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Abstract 

t-H ■ 

We consider the predictive problem of supervised ranking, where the task is to rank sets 
, of candidate items returned in response to queries. Although there exist statistical procedures 

<**] ' that come with guarantees of consistency in this setting, these procedures require that indi- 

viduals provide a complete ranking of all items, which is rarely feasible in practice. Instead, 
individuals routinely provide partial preference information, such as pairwise comparisons of 
, items, and more practical approaches to ranking have aimed at modeling this partial preference 

data directly. As we show, however, such an approach raises serious theoretical challenges. In- 
deed, we demonstrate that many commonly used surrogate losses for pairwise comparison data 
do not yield consistency; surprisingly, we show inconsistency even in low-noise settings. With 
these negative results as motivation, we present a new approach to supervised ranking based on 
^ c*| aggregation of partial preferences and we develop [/-statistic-based empirical risk minimization 

procedures. We present an asymptotic analysis of these new procedures, showing that they yield 
consistency results that parallel those available for classification. We complement our theoretical 
results with an experiment studying the new procedures in a large-scale web-ranking task. 
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Recent years have seen significant developments in the theory of classification, most notably binary 
classification, where strong theoretical results are available that quantify rates of convergence and 
shed light on qualitative aspects of the problem 50,3). Extensions to multi-class classification have 
also been explored, and connections to the theory of regression are increasingly well understood, 
so that overall a satisfactory theory of supervised machine learning has begun to emerge 49, 43], 
In many real-world problems in which labels or responses are available, however, the problem 
is not merely to classify or predict a real-valued response, but rather to list a set of items in order. 
The theory of supervised learning cannot be considered complete until it also provides a treat- 
ment of such ranking problems. For example, in information retrieval, the goal is to rank a set of 
documents in order of relevance to a user's search query; in medicine, the object is often that of 
ranking drugs in order of probable curative outcomes for a given disease; and in recommendation 
or advertising systems, the aim is to present a set of products in order of a customer's willing- 
ness to purchase or consume. In each example, the goal is to order a set of items in accordance 
with the preferences of an individual or population. While such problems are often converted to 
classification problems for simplicity (for example, a document is classified as "relevant" or not), 
decision makers frequently require the ranks (for example, a search engine must place documents 
in a particular ordering on the page) . Despite its ubiquity, our statistical understanding of ranking 
falls short of our understanding of classification and regression. Our aim here is to characterize 
the statistical behavior of computationally tractable inference procedures for ranking under natural 
data-generating mechanisms. 
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We consider a general decision-theoretic formulation of the supervised ranking problem in which 
preference data are drawn i.i.d. from an unknown distribution, and where each datum consists 
of a query, Q G Q, and a preference judgment, Y G y, over a set Mq of candidate items that 
are available based on the query Q. The exact nature of the query and preference judgment will 
depend on the ranking context. In the setting of information retrieval, for example, each datum 
corresponds to a user issuing a natural language query and expressing a preference by selecting or 
clicking on zero or more of the returned results. 

The statistical task is to discover a function that provides a query-specific ordering of items 
that best respects the observed preferences. This query-indexed setting is especially natural for 
tasks like information retrieval in which a different ranking of webpages is needed for each natural 
language query. 

Following existing literature, we estimate a scoring function f : Q — > W n , where f(q) assigns a 
score to each of m candidate items for the query q, and the results are ranked according to their 
scores [24j, |22j. Throughout the paper, we adopt a decision-theoretic perspective and assume that 
given a query-judgment pair (Q, Y), we evaluate the scoring function / via a loss L(f(Q), Y). The 
goal is to choose the / minimizing the risk 

R(f):=E[L(f(Q),Y)]. (1) 

While minimizing the risk ([I]) directly is in general intractable, researchers in machine learning and 
information retrieval have developed surrogate loss functions that yield procedures for selecting /. 
Unfortunately, as we show, extant procedures fail to the solve the ranking problem under reasonable 
data generating mechanisms. The goal in the remainder of the paper is to explain this failure and 
to propose a novel solution strategy based on preference aggregation. 

Let us begin to elucidate the shortcomings of current approaches to ranking. One main problem 
lies in their unrealistic assumptions about available data. The losses proposed and most commonly 
used for evaluation in the information retrieval literature [3l|, [2t| have a common form, generally 
referred to as (Normalized) Discounted Cumulative Gain ((N)DCG). The NDCG family requires 
that the preference judgements Y associated with the datum (Q, Y) be a vector Y G K m of relevance 
scores for the entire set of items; that is, Yj denotes the real-valued relevance of item j to the 
query Q. While having complete preference information makes it possible to design procedures 
that asymptotically minimize NDCG losses [e.g., [ljj, in practice such complete preferences are 
unrealistic: they are expensive to collect and difficult to trust. In biological applications, evaluating 
the effects of all drugs involved in a study — or all doses — on a single subject is infeasible. In web 
search, users click on only one or two results: no feedback is available for most items. Even when 
practical and ethical considerations do not rule out collecting complete preference information from 
participants in a study, a long line of psychological work has hig hligh ted the inconsistency with 



which humans assign numerical values to multiple objects [e.g.. |42|. |44|. 

The inherent practical difficulties that arise in using losses based on relevance scores has led 
other researchers to propose loss functions that are suitable for partial preference data 2^, H, [ItJ . 
Such data arise naturally in a number of real-world situations; for example, a patient's prognosis 
may improve or deteriorate after administration of treatment, competitions and sporting matches 
provide paired results, and shoppers at a store purchase one item but not others. Moreover, the 
psychological literature shows that human beings are quite good at performing pairwise distinctions 
and forming relative judgments [see, e.g.. |4Q. and references therein]. 

More formally, let a := f(Q) G K m denote the vector of predicted scores for each item asso- 
ciated with query Q. If a preference Y indicates that item i is preferred to j then the natural 
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associated loss is the zero-one loss L(a,Y) = 1 (pn < a,-). Minimizing such a loss is well known to 
be computationally intractable; nonetheless, the classification literature [1^, 0, 0, [43| has shown 
that it is possible to design convex Fisher-consistent surrogate losses for the 0-1 loss in classifi- 
cation settings and has linked Fisher consistency to consistency. By reduction to classification, 



similar consistency results are possible in certain bipartite or binary ranking scenarios [111 ]. One 
might therefore hope to make use of these surrogate losses in the ranking setting to obtain similar 
guarantees. Unfortunately, however, this hope is not borne out; as we illustrate in Section [3j it is 
generally computationally intractable to minimize any Fisher-consistent loss for ranking, and even 
in favorable low- noise cases, convex surrogates that yield Fisher consistency for binary classification 
fail to be Fisher-consistent for ranking. 

We find ourselves at an impasse: existing methods based on practical data-collection strategies 
do not yield a satisfactory theory, and those methods that do have theoretical justification are 
not practical. Our approach to this difficulty is to take a new approach to supervised ranking 
problems in which partial preference data are aggregated before being used for estimation. The 



point of departure for this approach is the notion of rank aggregation [e.g., l2ll |. which has a long 
history in voting 16], social choice theory [12I and statistics 46, [30]. In Section [2] we discuss 
some of the ways in which partial preference data can be aggregated, and we propose a new family 
of ?7-statistic-based loss functions that are computationally tractable. Sections [3] and 2] present 
a theoretical analysis of procedures based on these loss functions, establishing their consistency. 
We provide a further discussion of practical rank aggregation strategies in Section [5] and present 
experimental results in Section [6l Section [7] contains our conclusions, with proofs deferred to 
appendices. 



2 Ranking with rank aggregation 

We begin by considering several ways in which partial preference data arise in practice. We then 
turn to a formal treatment of our aggregation-based strategy for supervised ranking. 

1. Paired comparison data. Data in which an individual judges one item to be preferred over 
another in the context of a query are common. Competitions and sporting matches, where 
each pairwise comparison may be accompanied by a magnitude such as a difference of scores, 
naturally generate such data. In practice, a single individual will not provide feedback for 
all possible pairwise comparisons, and we do not assume transitivity among the observed 
preferences for an individual. Thus it is natural to model the pairwise preference judgment 
space y as the set of weighted directed graphs on m nodes. 

2. Selection data. A ubiquitous source of partial preference information is the selection behavior 
of a user presented with a small set of potentially ordered items. For example, in response 
to a search query, a web search engine presents an ordered list of webpages and records the 
URL a user clicks on, and a store records inventory and tracks the items customers purchase. 
Such selections provide partial information: that a user or customer prefers one item to others 
presented. 

3. Partial orders. An individual may also provide preference feedback in terms of a partial 
ordering over a set of candidates or items. In the context of elections, for example, each 
preference judgment Y £ y specifies a partial order -<y over candidates such that candidate 
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i is preferred to candidate j whenever i -<y j. A partial order need not specify a preference 
between every pair of items. 

Using these examples as motivation, we wish to develop a formal treatment of ranking based on 
aggregation. To provide intuition for the framework presented in the remainder of this section, let 
us consider a simple aggregation strategy appropriate for the case of paired comparison data. Let 
each relevance judgment Y £ y be a weighted adjacency matrix where the (i, j)th entry expresses 
a preference for item i over j whenever this entry is non-zero. In this case, a natural aggregation 
strategy is to average all observed adjacency matrices for a fixed query. Specifically, for a set 
of adjacency matrices {^}f =1 representing user preferences for a given query, we form the average 
(1/k) Yli=i Y\- As k — > oo, the average adjacency matrix captures the mean population preferences, 
and we thereby obtain complete preference information over the m items. 

This averaging of partial preferences is one example of a general class of aggregation strategies 
that form the basis of our theoretical framework. To formalize this notion, we modify the loss 
formulation slightly and hereafter assume that the loss function L is a mapping W n x S — > R, where 
S is a problem-specific structure space. We further assume the existence of a series of structure 
functions, s k : y k — > S, that map sets of preference judgments {Yj} into S. The loss L depends on 
the preference feedback (Y\, . . . , Yk) for a given query only via the structure Sk(Yi, . . . , Yk). In the 
example of the previous paragraph, S is the set ofmxm adjacency matrices, and Sk(Y\, . . . , Yk) = 
(1/k) X^f=i ^z- A typical loss for this setting is the pairwise loss [22L |28| 

L(a, s(Y 1 ,..., Y k )) = L(a, A) = ^ ^1 (a; < aj) + ^ A^l {on < aj) , 

i<j i>j 

where a is a set of scores, and A = Sfc(Yi, . . . , Yk) is the average adjacency matrix with entries 
Aij . In Section [5j we provide other examples of structure functions for different data collection 
mechanisms and losses. Hereafter, we abbreviate Sk(Y\, . . . , Yk) as s(Y\, . . . , Yfe) whenever the input 
length k is clear from context. 

To meaningfully characterize the asymptotics of inference procedures, we make a mild assump- 
tion on the limiting behavior of the structure functions. 

Assumption A. Fix a query Q = q. Let the sequence Y±,Y2 . . . be drawn i.i.d. conditional on q 
and define the random variables Sk ■= s(Y\, . . . , Yk). If Hq denotes the distribution of Sk, there 
exists a limiting law fi q such that 

Mo ~^ as k — > OO. 

For example, the averaging structure function satisfies Assumption |A] so long as E[|Yy| | Q] < 00 
with probability 1. Aside from the requirements of Assumption lAl we allow arbitrary aggregation 
within the structure function. 

In addition, our main assumption on the loss function L is as follows: 

Assumption B. The loss function L : W n x S — > R is bounded in [0, 1], and, for any fixed vector 
a G M. m , L(q, •) is continuous in the topology of S. 

Having stated assumptions on the asymptotics of the structure function s and the loss L, we 
can turn to a discussion of the risk functions that guide our design of inference procedures. We 
begin with the pointwise conditional risk, which maps predicted scores and a measure \x on S to 
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[0,1]: 

t : M m x M(S) -> [0, 1] where £(a, fJ,) = J L(a, s)dfx(s). (2) 

Here M(S) denotes the closure of the subset of probability measures on the set S for which £ 
is defined. For any query q and a € M m , we have lim, £(a, /i q ) = £(a,n q ) by the definition of 
convergence in distribution. This convergence motivates our decision-theoretic approach. 
Our goal in ranking is thus to minimize the risk 

#(/):= ZVtffoW. (3) 
<? 

where p q denotes the probability that the query Q = q is issued. The risk of the scoring function 
/ can also be obtained in the limit as the number of preference judgments for each query goes to 
infinity: 

R(f) = limE[L(f(Q),s(Y 1 ,...,Y k ))] = lim^p^(/(g),^). (4) 

q 

That the limiting expectation (j3|) is equal to the risk Q follows from the following argument: since 
YlqPq = 1' f° r an Y e > we can choose a set Q(e) such that ^2 q ^QU)Pq < and Q(e) c is finite. 
For each q £ Q(e) c , there exists some k{q) € N such that \£{f{q),H q ) — £(f(q), /Aj)| < e for A; > k(q); 
define K to be the maximum of such k(q). Then for k > K, 



< e + e p q <2e. 



< E E P 9 |W9US)-W?).M,) 

geQ(e) geQ(e) c geS(e) c 

We face two main difficulties in the study of the minimization of the risk ([3]) . The first difficulty 
is that of Fisher consistency mentioned previously: since L may be non-smooth in the function / 
and is typically intractable to minimize, when will the minimization of a tractable surrogate lead to 
the minimization of the loss ([3])? We provide a precise formulation of and answer to this question in 
Section [3l In addition, we demonstrate the inconsistency of many commonly used pairwise ranking 
surrogates and show that aggregation leads to tractable Fisher consistent inference procedures for 
both complete and partial data losses. 

The second difficulty is that of consistency: for a given Fisher consistent surrogate for the 
risk (|3|), are there tractable statistical procedures that converge to a minimizer of the risk? Yes: in 
Section^ we develop a new family of aggregation losses based on [/-statistics of increasing order, 
showing that uniform laws of large numbers hold for the resulting M-estimators. 



3 Fisher consistency of surrogate risk minimization 

In this section, we formally define the Fisher consistency of a surrogate loss and give general 
necessary and sufficient conditions for consistency to hold for losses satisfying Assumption |Bj To 
begin, we assume that the space Q of queries is countable (or finite) and thus bijective with N. 
Recalling the definition (|3|) of the risk and the pointwise conditional risk ([2]), we define the Bayes 
risk for R as the minimal risk over all measurable functions / : Q — > W m : 

R* := mfR(f) = "£p q MJ(a, » q ). 
q a 
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The second equality follows because Q is countable and the infimum is taken over all measurable 
functions. 

Since it is infeasible to minimize the risk ([3]) directly, we consider a bounded-below surrogate 
ip to minimize in place of L. For each structure s G S, we write ip(-, s) : W 71 — > R + , and we assume 
that for a G M m , the function s i— > <p(a,s) is continuous with respect to the topology on 5. We 
then define the conditional <yj-risk as 

i v {a,n):= j ip(a, s)dfi(s) (5) 

and the asymptotic 92-risk of the function / as 

^(/):=&W(9)>A* ff ), (6) 

whenever each £ v> (f(q), n q ) exists (otherwise Rip(f) = +00). The optimal 93-risk is defined to be 
R* := mff Rtpif), and throughout we make the assumption that there exist measurable / such 
that Rip(f) < +00 so that R* is finite. The following is our general notion of consistency. 

Definition 1. The surrogate loss <p is Fisher-consistent for the loss L if for any {p q } and probability 
measures fi q G Ai(S), the convergence 

Rtp(fn) -> R* v implies R(f n ) -> R* ■ 

To achieve more actionable risk bounds and to more accurately compare surrogate risks, we 
also draw upon a uniform statement of consistency: 

Definition 2. The surrogate loss (p is uniformly consistent for the loss L if for any e > 0, there 
exists a 5(e) > such that for any {p q } and probability measures fi q G Ai(S), 

Rtptf) <R*^ + 5{e) implies R{f) < R* + e. (7) 

The bound (|7|) is equivalent to the assertion that there exists a non-decreasing function £ such that 
C(0) = and R(f)—R* < C(Rip(f)~R^p)- Bounds of this form have been completely characterized in 
the case of binary classification , and Steinwart [43[ has given necessary and sufficient conditions 
for uniform consistency to hold. We now turn to analyzing conditions under which a surrogate loss 
ip is consistent for ranking. 



3.1 General theory 

The main approach in establishing conditions for the surrogate risk consistency in Definition Q] 
is to move from global conditions for consistency to local, pointwise consistency. Following the 
treatment of Steinwart ji^j, we begin by defining a function measuring the discriminating ability 
of the surrogate ip: 

H(e):= inf UJa, a) - inf Ua', a) \ £(a, fi) - inf £(a', (J,) > el . (8) 

fieM(S),a { a! a' J 

This function is familiar from work on surrogate risk consistency in classification 0] and measures 
surrogate risk suboptimality as a function of risk suboptimality. A reasonable conditional 93-risk 
will declare a set of scores a G W 71 suboptimal whenever the conditional risk £ declares them 
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suboptimal. This corresponds to H(e) > whenever e > 0, and we call any loss satisfying this 
condition pointwise consistent. 

From these definitions, we can conclude the following consistency result, which is analogous to 



the results of (431 ]. For completeness, we provide a proof in supplementary Appendix IA.1I 



Proposition 1. Let (p : R m x S — > R+ be a bounded-below loss function such that for some f , 
Rtp(f) < +oo. Then ip is pointwise consistent if and only if the uniform consistency definition £ty 
holds. 

Proposition [1] makes it clear that pointwise consistency for general measures [i on the set of 
structures S is a stronger condition than that of consistency in Definition [TJ In some situations, 
however, it is possible to connect the weaker surrogate risk consistency of Definition [1] with uniform 
consistency and pointwise consistency. Ranking problems with appropriate choices of the space S 
give rise to such connections. Indeed, consider the following: 

Assumption C. The space of possible structures S is finite, and the loss L is discrete, meaning 
that it takes on only finitely many values. 

Binary and multiclass classification provide examples of settings in which Assumption [C] is appro- 
priate, since the set of structures S is the set of class labels, and L is usually a version of the 0-1 
loss. We also sometimes make a weaker version of Assumption ICl 

Assumption C. The (topological) space of possible structures S is compact, and there exists a 
partition A\, . . . , Ad of W 71 such that for any s G S, 

L(a, s) = L(ot! ', s) whenever a, a' £ Ai- 

Assumption [C^ may be more natural in ranking settings than Assumption [Cj The compactness 
assumption holds, for example, if S C M m and is closed and bounded, such as in our pairwise 
aggregation example in Section [2j Losses L that depend only on the relative order of the coordinate 
values of a S M. m — common in ranking problems — provide a collection of examples for which the 
partitioning condition holds. 

Under Assumption [O or IC'[ we can provide a definition of local consistency that is often more 
user-friendly than pointwise consistency ([8]): 

Definition 3. Let p be a bounded-below surrogate loss such that <p(-, s) is continuous for all s 6 S. 
The function ip is structure-consistent with respect to the loss L if for all u e M(S), 

t {p) := inf ^(a,/x) < inf < l^ia,^) \ a argrnin£(o/, /x) 



Definition[3]describes the set of loss functions cp satisfying the intuitively desirable property that the 
surrogate ip cannot be minimized if the scores a £ M. m are restricted to not minimize the loss L. As 
we see presently, Definition [3] captures exactly what it means for a surrogate loss <p to be consistent 
when one of Assumptions O or [C3 holds. Moreover, the set of consistent surrogates coincides with 
the set of uniformly consistent surrogates in this case. The following theorem formally states this 
result; we give a proof in supplementary Appendix IA.2I 

Theorem 1. Let ip : W m X S — >■ satisfy Rtp(f) < +oo for some measurable f. If Assumption [Cl 
holds, then 
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(a) If if is structure consistent (Definition^, then p is uniformly consistent for the loss L (Defi- 
nition^. 

(b) If (p is Fisher- consistent for the loss L (Definition^, then ip is structure consistent. 

If the function (/?(•, s) is convex for s G S, and for fj, £ M(S) the conditional risk £ ip [a,ii) — > oo as 
||a|| — > oo, then Assumption \U\ implies and 

Theorem Q] shows that as long as Assumption O holds, pointwise consistency, structure con- 
sistency, and both uniform and non-uniform surrogate loss consistency coincide. These four also 
coincide under the weaker Assumption [U3 so long as the surrogate is 0-coercive, which is not re- 
strictive in practice. As a final note, we recall a result due to Steinwart [4^], which gives general 
necessary and sufficient conditions for the consistency in Definition [T] to hold. We begin by giving 
a weaker version of the suboptimality function ([8]) that depends on \x: 

Hie, u) := inf < £ ip (a, a) — inf fi) I £(a, pi) — inf £(a' , fi) > e > . (9) 

a I a' a' J 

Proposition 2 (Steinwart (4^|, Theorems 2.8 and 3.3). The suboptimality function (OJ) satisfies 
H(e, n q ) > for any e > and [i q with q G Q and p q > if and only if ip is Fisher- consistent for 
the loss L (Definition^). 

We remark that as a corollary of this result, any structure-consistent surrogate loss ip (in the sense 
of Definition [3]) is consistent for the loss L whenever the conditional risk £(a, (i) has finite range, so 
that a G" argmin Q / £(a', fi) ^ implies the existence of an e > such that £(a, n) — inf Q / £(a, /i) > e. 



3.2 The difficulty of consistency for ranking 

We now turn to the question of whether there exist structure-consistent ranking losses. In a pre- 



liminary version of this work [19j, we focused on the practical setting of learning from pairwise 
preference data and demonstrated that many popular ranking surrogates are inconsistent for stan- 
dard pairwise ranking losses. We review and generalize our main inconsistency results here, noting 
that while the losses considered use pairwise preferences, they perform no aggregation. Their theo- 
retically poor performance provides motivation for the aggregation strategies proposed in this work; 
we explore the connections in Section [5] (focusing on pairwise losses in Section 15 . 3[) . We provide 
proofs of our inconsistency results in supplementary Appendix [Cj 

To place ourselves in the general structural setting of the paper, we consider the structure 
function s(Y\, . . . , Y/.) = Y\ which performs no aggregation for all k, and we let Y denote the 
weighted adjacency matrix of a directed acyclic graph (DAG) G, so that Yy is the weight of the 
directed edge {i — > j) in the graph G. We consider a pairwise loss that imposes a separate penalty 
for each misordered pair of results: 

L(a, Y) = Y^ Yijl (a, < aj ) + Y *i l («< < Q i) > ( 10 ) 

i<j i>j 

where the cases i < j and i > j are distinguished to avoid doubly penalizing 1 (on = aj). When 
pairwise preference judgments are available, use of such losses is common. Indeed, this loss gener- 



alizes the disagreement error described by Dekel et al. 13] and is similar to losses used by Joachims 
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[2J. If we define Yjj := J Y i:j dfi(Y), then 

£(a, //) = Ygl (a, < a 3 ) + ]T Y$l (a* < a,-) . (11) 

We assume that the number of nodes in any graph G (or, equivalently, the number of results 
returned by any query) is bounded by a finite constant M. Hence, the conditional risk (llip has 
a finite range; if there are a finite number of preference labels Y or the set of weights is compact, 
Assumptions O or [C3 are satisfied, whence Theorem [1] applies. 



3.2.1 General inconsistency 

Let the set P denote the complexity class of problems solvable in polynomial time and NP denote 
the class of non-deterministic polynomial time problems [see, e.g.,|2g]. Our first inconsistency result 



(see also 19J, Lemma 7]) is that unless P = NP (a widely doubted proposition), any loss that is 



tractable to minimize cannot be a consistent surrogate for the loss (jlOp and its associated risk. 
Proposition 3. Finding an a minimizing £ is NP-hard. 

In particular, most convex functions are minimizable to an accuracy of e in time polynomial in 
the dimension of the problem times a multiple of log known as poly-logarithmic time Qj. Since 
any a minimizing £ ip (a, li) must minimize £(a, li) for a consistent surrogate (p, and £(-,li) has a 
finite range (so that optimizing £ v to a fixed e accuracy is sufficient), convex surrogate losses are 
inconsistent for the pairwise loss (|10p unless P = NP. 



3.2.2 Low-noise inconsistency 

We now turn to showing that, surprisingly, many common convex surrogates are inconsistent even 
in low-noise settings in which it is easy to find an a minimizing £(a,Li). (Weaker versions of the 
results in this section appeared in our preliminary paperJE^L) Inspecting the loss definition (jlOp . 



a natural choice for a surrogate loss is one of the form 24, 22, l3] 



V (a,Y) = Y,KYiM^-a j ), (12) 



i,3 



where eft > is a convex function, and h is a some function of the penalties Yij. This surrogate 
implicitly uses the structure function s(Y\, . . . , 1&) = Y\ and performs no preference aggregation. 
The conditional surrogate risk is thus £ Lp {a,Li) = Yli^j hij<fi{ a i ~ a j), where := J h(Yij)d/i(Y). 
Surrogates of the form (j 12 j) are convenient in margin-based binary classification, where the complete 



description by Bartlett, Jordan, and McAuliffe [3| shows eft is Fisher-consistent if and only if it is 
differentiable at with 0'(O) < 0. 

We now precisely define our low- noise setting. For any measure ll on a space y of adjacency 
matrices, let the directed graph G M be the difference graph, that is, the graph with edge weights 
max{y^- Y^, 0} on edges (i — > j), where Yj 1 - = f Yijdfi(Y). Then we say that the edge (i — > j) G M 
if Yj 1 - < Y-l (see Figure [1]) . We define the following low- noise condition based on self-reinforcement 
of edges in the difference graph. 

Definition 4. The measure ll on a set Y of adjacency matrices is low-noise when the corresponding 
difference graph satisfies the following reverse triangle inequality: whenever there is an edge 
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Figure 1. The two leftmost DAGs occur with probability 4, yielding the difference graph G M at 
right, assuming F 2 3 > ^32- 



(i — > j) and an edge (j — > k) in G^, then the weight Y-i — Yj~. on the edge {i —> k) is greater than 
or equal to the path weight Y-j — Y£ + Y-f. — Yj?- on the path (i — > j — > k) . 

If \i satisfies Definition HJ its difference graph is a DAG. Indeed, the definition ensures that 
all global preference information in G^ (the sum of weights along any path) conforms with and 
reinforces local preference information (the weight on a single edge). Hence, we would expect any 
reasonable ranking method to be consistent in this setting. Nevertheless, typical pairwise surrogate 
losses are inconsistent in this low- noise setting (see also the weaker Theorem 11 in our preliminary 
work \l{ 



Theorem 2. Let ip be a loss of the form U2\) and assume h(0) = 0. If (f> is convex, then even in 
the low-noise setting of Definition^ the loss ip is not structure- consistent. 

Given the difficulties we encounter using losses of the form (|12p . it is reasonable to consider 
a reformulation of the surrogate. A natural alternative is a margin-based loss, which encodes a 
desire to separate ranking scores by large margins dependent on the preferences in a graph. Similar 



losses have been proposed, e.g., by Shashua and Levin 4l|]. The next result, shows that convex 
margin-based losses are also inconsistent, even in low-noise settings. (See also the weaker Theorem 
12 of our preliminary work (l9(|.) 

Theorem 3. Let p be a loss of the form 

ip(a,Y)= tiai-cXj-HYij)), (13) 

where h : R — > K. If (j) is convex, then even in the low-noise setting of Definition^ the loss p is 
not structure- consistent. 



3.3 Achieving consistency 

Although Section 13.21 suggests an inherent difficulty in the development of tractable losses for 
ranking, tractable consistency is in fact achievable if one has access to complete preference data. 
We review a few of the known results here, showing how they follow from the consistency guarantees 
in Section 13.11 deriving some new consistency guarantees for the complete data setting (we defer 
all proofs to supplementary Appendix [A]) . As we have argued, these results are of limited practical 
value per se, since complete preference judgements are typically unavailable or untrustworthy, but, 
as we show in Sections d] and [5j they can be combined with aggregation strategies to yield procedures 
that are both practical and come with consistency guarantees. 
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We first define the Normalized Discounted Cumulative Gain (NDCG) family of complete data 
losses. Such losses are common in applications like web search, since they penalize ranking errors 
at the top of a ranked list more heavily than errors farther down the list. Let s G S C W l be 
a vector of relevance scores and a G R m be a vector of predicted scores. Define 7r a to be the 
permutation associated with a, so that 7r a (j) is the rank of item j in the ordering induced by a. 



Following Ravikumar et al. 37T ] . a general class of NDCG loss functions can be defined as follows: 



Ih " =1 -^i^. z «=-r£i&' (14> 

where G and F are functions monotonically increasing in their arguments. By inspection, L G [0, 1], 
and we remark that the standard NDCG criterion [27| uses G(sj) = 2 s j — 1 and F(j) = log(l + j). 
The "precision at k" loss [311 ] can also be written in the form (j!4|) . where G(sj) = Sj (assuming 



that Sj > 0) and F(J) = 1 for j < k and F(j) = +oo otherwise, which measures the relevance of 
the top k items given by the vector a. This form generalizes standard forms of precision, which 
assume Sj G {0, 1}. 

To analyze the consistency of surrogate losses for the NDCG family (|14p. we begin by computing 
the loss £(a,/j>), then state a corollary to Proposition [2j First, we observe that for any fj, G Ai(S), 



f^FMi))} z(») 

Since the function F is increasing in its argument, minimizing £(a, fx) corresponds to choosing any 
vector a whose values ay obey the same order as the m points J G(sj)/Z(s)dLi(s). In particular, 
the range of £ is finite for any /j, since it depends only on the permutation induced by a, so we have 

Corollary 1. Define the set 

A{n) = [a £ R m | ay > m when [ < ^4dfi(s) > [ < =§^ r dn(s)) . (15) 



Z(s) ^ ' J Z(s) 

A surrogate loss (p is Fisher- consistent for the NDCG family p4\ ) if and only if for all fi G A4(S), 



inf < £ v (a, [i) — inf £ v (a' , fi) I a Ai^j) > > 0. 
a { a' J 



Corollary [Tj recovers the main flavor of the consistency results in the papers of Ravikumar et al. 
13 and Buffoni et al. 0j. The surrogate (p is consistent if and only if it preserves the order of 



the integrated terms f G(sj)/Z(s)d/j,(s), that is, any sequence a n tending toward the infimum of 
£ ip (a, fi) must satisfy a n G A(fi) for large enough n. Zhang (4s| presents several examples of such 
losses; as a corollary to his Theorem 5 [also noted by0], the loss 



(p{a, s) := -^A- Yj ~ a i) 

3=1 1 ' 1=1 

is convex and structure-consistent (in the sense of Dennition]3]) whenever <f> : M — > R + is non- 
increasing, differentiable, and satisfies 4>'(0) < 0. The papers [37], 0] contain more examples and a 
deeper study of NDCG losses. To extend Corollary [Tj to a uniform result, we note that if G(sj) > 
for all j and S is compact, then (p is 0-coercive, whence Theorem [Tj implies that structure consistency 
coincides with uniform consistency. 
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Another family of loss functions is based on a cascade model of user behavior These 
losses model dependency among items or results by assuming that a user scans an ordered list of 
results from top to bottom and selects the first satisfactory result, where satisfaction is determined 
independently at each position. The form of such expected reciprocal rank (ERR) losses is 

m ^ i—1 

L{a, s) = 1 - ^77T G (^«) II^ 1 " G (^(i)))' ( 16 ) 
i=i * w 3=1 

where G : R — > [0, 1] is a non-decreasing function that indicates the prior probability that a result 
with score Sj is selected, and F : N — > [l,oo) is an increasing function that more heavily weights 
the first items. The ERR family also satisfies L G [0, 1], and empirically correlates well with user 
satisfaction in ranking tasks 0]. 

Computing the expected conditional risk £(a,fi) for general fx G M(S) is difficult, but we can 
compute it when fi is a product measure over s\, . . . , s m . Indeed, in this case, we have 

m „ i—1 

£(a,n) = l-J2- r - / G(8 Wa(!i) )l[(l-G(s naU) ))dn(8) 

i=l t W J 4=1 



J-- 

-1 



= 1 " E s^W^fi))] J] (1 - MG{s, ai 3))]) . 

i=l t W 3=1 

When one believes that the values G{si) represent the a priori relevance of the result i, this inde- 
pendence assumption is not unreasonable, and indeed, in Section [5] we provide examples in which 
it holds. Regardless, we see that £(a,n) depends only on the permutation 7r a , and we can compute 
the minimizers of the conditional risk for the ERR family (j 16f) using the following lemma, whose 
proof we provide in supplementary Appendix IA.31 

Lemma 1. Let = E„[G(sj)]. The permutation tt minimizing £(a, fi) is in decreasing order of the 
Pi- 

Lemma [TJ shows that an order-preserving property is necessary and sufficient for the consistency of 
a surrogate <p for the ERR family (|16j) . as it was for the NDCG family (|14p . To see this, we apply 
a variant of Corollary [TJ where A(/i) as defined in Eq. (|15p is replaced with the set 

A(fi) = | a G M. m | ctj > ai whenever J G(sj)dfi(s) > J G(si)dfj,(s) \ . 



Theorem 5 of [49( implies that <p{a, s) = YlJLi G ( s i) S|=i ~ a j) 1S a consistent surrogate when 
(j) is convex, differentiable, and non- increasing with (f>'(0) < 0. Theorem [TJ also yields an equivalence 
between structure and uniform consistency under suitable conditions on S. 

Before concluding this section, we make a final remark, which has bearing on the aggregation 
strategies we discuss in Sectional We have assumed that the structure spaces S for the NDCG (fTij) 
and ERR (|16p loss families consist of real- valued relevance scores. This is certainly not necessary. In 
some situations, it may be more beneficial to think of s G S as simply an ordered list of the results 
or as a directed acyclic graph over {1, . . . , m}. We can then apply a transformation r : S — > M m to 
get relevance scores, using those in the losses fTS]) and (fT6|) . This has the advantage of causing S 
to be finite, so Theorem [TJ applies and there exists a non-decreasing function £ with £(0) = such 
that for any distribution and any measurable /, 

R(f) -R*< C(RM) ~ K)- 
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4 Uniform laws and asymptotic consistency 



In Section El we gave examples of losses based on readily available pairwise data but for which 
Fisher-consistent tractable surrogates do not exist. The existence of Fisher-consistent tractable 
surrogates for other forms of data, as in Section 13.31 suggests that aggregation of pairwise and 
partial data into more complete data structures, such as lists or scores, makes the problem easier. 
However, it is not obvious how to design statistical procedures based on aggregation. In this 
section, we formally define a class of suitable estimators that permit us to take advantage of the 
weak convergence of Assumption [A] and show that uniform laws of large numbers hold for our 
surrogate losses. This means that we can indeed asymptotically minimize the risk Q as desired. 

Our aim is to develop an empirical analogue of the population surrogate risk (|6|) that converges 
uniformly to the population risk under minimal assumptions on the loss <p and structure function 
s. Given a dataset {(Qi,Yi)}2 =1 with Yj) G Q x y, we begin by defining, for each query q, the 
batch of data belonging to the query, B(q) = {i G {1, . . . , n} | Qi = q}, and the empirical count of 
the query, n q = \B(q)\. As a first attempt at developing an empirical objective, we might consider 
an empirical surrogate risk based on complete aggregation over the batch of data belonging to each 
query: 

(/(?), s({Y h ,...,Y irig \ ij G B(q)})) . (17) 

q 

While we would expect this risk to converge uniformly when cp is a sufficiently smooth function 
of its structure argument, the analysis of the complete aggregation risk requires overly detailed 
knowledge of the surrogate <p and the structure function s. 

To develop a more broadly applicable statistical procedure, we instead consider an empirical sur- 
rogate based on [/-statistics. By trading off the nearness of an order-A; [/-statistic to an i.i.d. sample 
and the nearness of the limiting structure distribution \x q to a structure s(Yi, . . . , Y&) aggregated 
over k draws, we can obtain consistency under mild assumptions on ip and s. More specifically, for 
each query q, we consider the surrogate loss 

-l 

]T ( p(f(q),s(Y il ,...,Y ik )). (18) 

When n q < k, we adopt the convention (^) = 1 and the above sum becomes the single term 
<p(f(q),s(Yi 1 , . . . , Yi n )) with {ii, . . . = B(q) as in the expression (fT?]) . Hence, our [/-statistic 

loss recovers the complete aggregation loss (|17p when k = oo. 

An alternative formulation to loss (fT8|) might consist of [|B(g)|/fc] aggregation terms per query, 
with each query-preference pair appearing in a single term. However, the instability of such a 
strategy is high: a change in the ordering of the data or a substitution of queries could have a large 
effect on the final estimator. The [/-statistic (]18p grants robustness to such perturbations in the 
data. Moreover, by choosing the right rate of increase of the aggregation order A; as a function of 
n, we obtain consistent procedures for a broad class of surrogates ip and structures s. 

We associate with the surrogate loss (|18h a surrogate empirical risk which weights each query 
by its empirical probability of appearance: 

^."(fl-^E^C?) 1 £ <p(f(q),s(Y n ,...,Y tk )). (19) 

q ii<—<i k , 
ijeB(q) 
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Let ¥ n denote the probability distribution of the queries given that the dataset consists of a total 
of n samples. Then by iteration of expectation and Fubini's theorem, the surrogate risk (I19p is an 
unbiased estimate of the population quantity 



RvM ■= V 



^lV n (n q = l)M[<p(f(Q),s(Y 1 ,...,Y lAk )) \ Q = q] 



1=1 



(20) 



It remains to establish a uniform law of large numbers guaranteeing the convergence of the 
empirical risk (|19p to the target population risk ([6]). Under suitable conditions such as those of 
Section O this ensures the asymptotic consistency of computationally tractable statistical proce- 
dures. Hereafter, we assume that we have a non-decreasing sequence of function classes J- n , where 
any / E J- n is a scoring function for queries, mapping / : Q — > M m and giving scores to the (at 
most m) results for each query q E Q. Our goal is to give sufficient conditions for the convergence 
in probability 



sup 



Rp,n(f)- R M) ^° asn^oo. (21) 



While we do not provide fully general conditions under which the convergence (|2ip occurs, 
we provide representative, checkable conditions sufficient for convergence. At a high level, to 
establish (|2ip . we control the uniform difference between the expectations R^^ n {f) and Ru>(f) and 
bound the distance between the empirical risk i?<n n and its expectation Rtp n , via covering number 
arguments. We now specify assumptions under which our results hold, deferring all proofs to the 
supplementary Appendix [Bj 

Without loss of generality, we assume that p q , the true probability of seeing the query q, is 
non-increasing in the query index q. First, we describe the tails of the query distribution: 

Assumption D. There exist constants f3 > and K\ > such that p q < K\q~^~^ for all q, that 
is, p q = 0{q-?- 1 ). 

Infinite sets of queries Q are reasonable, since search engines, for example, receive a large volume 
of entirely new queries each day. Our arguments also apply when Q is finite, in which case we can 
take P t oo. 

Our second main assumption concerns the behavior of the surrogate loss <p over the function 
class J-'n, which we assume is contained in a normed space with norm ||-||. 

Assumption E (Bounded Lipschitz Losses). The surrogate loss function ip is bounded and Lipschitz 
continuous over F n : for any s E S, any f,f\,fa E T n , and any q E Q, there exist constants B n 
and L n < oo such that 

0< V (f(q),s)<B n 

and 

Wi(Q),s)-i P (f 2 (q),s)\<L n \\f 1 -f 2 \\. 

This assumption is satisfied whenever (p(-,s) is convex and J- n is compact (and contained in the 
interior of the domain of <£>(•, s)) (2^]. Our final assumption gives control over the sizes of the 
function classes J- n as measured by their covering numbers. (The e-covering number of J- is the 
smallest N such that there are /*, i < N, such that min^ \\f l — f\\ < e for any / E J- .) 

Assumption F. For all e > 0, JF n has e-covering number N(e,n) < oo. 
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With these assumptions in place, we give a few representative conditions that enable us to 
guarantee uniform convergence (I2ip . Roughly, these conditions control the interaction between the 
size of the function classes J- n and the order k of aggregation used with n samples. To that end, we 
let the aggregation order k n grow with n. In stating the conditions, we make use of the shorthand 
E q [<p(f(q), s(Y 1:k ))} for E[p(/(Q), s(Y u . . . , Y k )) \ Q = q]. 

Condition I. There exists a p > and constant C such that for all q £ Q, n £ N, k £ N, and 
E 3 ^(/( g ), sCn, . . .,Y k ))] - limE q [i P (f(q),s(Y 1 ,. . .,¥*,))] < CB n k~". 

k' 

Additionally, the sequences B n and k n satisfy B n = o{kn)- 

This condition is not unreasonable; when (p and s are suitably continuous, we expect p > ^. We 
also consider an alternative covering number condition. 

Condition I'. The sequences e n and k n and an e n -cover J~^, • • • , JF n ^ ' ^ of IP n can be chosen such 
that 



max inf 

ie[N(e n ,n)] feH 



q 



p q E q [ ( p(f(q),s(Y 1 ,...,Y kn ))] 



0. 



Condition [T] is weaker than Condition HI since it does not require uniform convergence over 
q £ Q. If the function class J- is fixed for all n, then the weak convergence of s(Yi, . . . , Y k ) as in 
Assumption [A] guarantees Condition IH since N(e, n) = N(e, n') < oo and we may take e arbitrarily 
small. We require one additional condition, which relates the growth of k n , B n , and the function 
classes T n more directly. 



Condition II. The sequences k n and B n satisfies k n B. 
e > 0, the sequences satisfy 



i±l 



o{n). Additionally, for any fixed 



nj n J - J n 



logN 



4L. 



n 



o( v / n). 



By inspection, Condition [II] is satisfied for any k n = o(y / n) if the function classes J- n are fixed 
for all n. Similarly, if for all k > k^, s(Y±, . . . , Y k ) = s(Y\, . . . , Y ko ), so s depends only on its 
first k() arguments, Condition [XT] holds whenever m.ax.{Bn + ^^ , B\ log A r (e/4L n , n)} = o(n). If 
the function classes J- n consist of linear functionals represented by vectors 9 £ M, dn in a ball of 
some finite radius, then logA^(e,n) ~ d n loge _1 , which means that Condition HI1 roughly requires 
k n sj d n jn — > as n — > oo. Modulo the factor k n , this condition is familiar for its necessity for 
convergence of parametric statistical problems. 

The conditions in place, we come to our main result on the convergence of our [/-statistic-based 
empirical loss minimization procedures. 

Theorem 4. Assume Condition^ or fJ] and additionally assume the growth Conditioning Under 
Assumptions Wj IM and\F\ we obtain 



sup 



R v ,n(f) - RyU) 



0. 
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We remark in passing that if Condition |TT] holds, except that the o(y / n) bound is replaced by 0(n~ p ) 
for some p < 5, the conclusion of Theorem H] can be strengthened to both convergence almost surely 
and in expectation. 

By inspection, Theorem [4] provides our desired convergence guarantee (12ip . By combining the 
Fisher-consistent loss families outlined in Section 13.31 with the consistency guarantees provided 
by Theorem |4j it is thus possible to design statistical procedures that are both computationally 
tractable — minimizing only convex risks — and asymptotically consistent. 



5 Rank aggregation strategies 

In this section, we give several examples of practical strategies for aggregating disparate user 
preferences under our framework. Motivated by the statistical advantages of complete preference 
data highlighted in Section 13.31 we nrs t present strategies for constructing complete vectors of 
relevance scores from pairwise preference data. We then discuss a model for the selection or 
"click" data that arises in web search and information retrieval and show that maximum likelihood 
estimation under this model allows for consistent ranking. We conclude this section with a brief 
overview of structured aggregation strategies. 



5.1 Recovering scores from pairwise preferences 

Here we treat partial preference observations as noisy evidence of an underlying complete ranking 
and attempt to achieve consistency with respect to a complete preference data loss. We con- 
sider three methods that take as input pairwise preferences and output a relevance score vector 
s € M m . Such procedures fit naturally into our ranking- with-aggregation framework: the results 
in Section 13.31 and Section 0] show that a Fisher-consistent loss is consistent for the limiting distri- 
bution of the scores s produced by the aggregation procedure. Thus, it is the responsibility of the 
statistician — the designer of an aggregation procedure — to determine whether the scores accurately 
reflect the judgments of the population. We present our first example in some detail to show how 
aggregation of pairwise judgments can lead to consistency in our framework, following with brief 
descriptions of alternate aggregation strategies. For an introduction to the design of aggregation 



strategies for pairwise data, see Tsukida and Gupta [47( as well as the book by David [15[] . 



Thurstone-Mosteller least squares and skew-symmetric scoring The first aggregation 
strategy constructs a relevance score vector s in two phases. First, it aggregates a sequence of 
observed preference judgments Yi G y, provided in any form, into a skew-symmetric matrix A £ 
M mxm satisfying A = — A T . Each entry Aij encodes the extent to which item i is preferred to item 
j. Given such a skew-symmetric matrix, Thurstone and Mosteller 34j] recommend deriving a score 



vector s such that s, — Sj ~ A^. In practice, one may not observe preference information for every 

f2 T , Qa = 1, and = 1 
Letting o denote the 



pair of results, so we define a masking matrix 0, € {0, l} mxm with f2 = 
if and only if preference information has been observed for the pai r i 



Hadamard product, a natural objective for selecting scores [e.g., [23J is the least squares objective 



+ 3- 



minimize — \^ftij(Ai 

x:x T l=0 4 Z -' 



Oo (A- (lx T -xl T )) 



Fr 



(22) 
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The gradient of the objective (|22p is 

Dqx - (Q o A)t - Qx where D n := diag(f)l). 

Setting s = (Dq — 0)^(0 o A) 1 yields the solution to the minimization problem (|22p . since -Dq — 
is an unnormalized graph Laplacian matrix 10], and therefore l T s = 1 T (Dq — fi)' (fi oi)l = 0. 

If n = 11 T , so that all pairwise preferences are observed, then the eigenvalue decomposition of 
Dq — Q = ml — 11 T can be computed explicitly as yEV T , where V is any orthonormal matrix 
whose first column is 1/y/m, and £ is a diagonal matrix with entries (once) and m repeated m—1 
times. Thus, letting xa and xb denote solutions to the minimization problem ()22[) with different 
skew-symmetric matrices A and B and noting that Al _L 1 since l T ^4l = 0, we have the Lipschitz 
continuity of the solutions s in A: 



\XA - Xb\\1 



(ml - ll T )t(A - B)l 2 = -j\\(A - B)l\\\ < -\A - B 



Similarly, when Q is fixed, the score structure s is likewise Lipschitz in A for any norm |||-||| on 
skew-symmetric matrices. 

A variety of procedures are available for aggregating pairwise comparison data Yi G y into a 
skew-symmetric matrix A. One example, the Bradley- Terry-Luce (BTL) model @], is based upon 
empirical log-odds ratios. Specifically, assume that Yi £ y are pairwise comparisons of the form 
j y I, meaning item j is prefered to item I. Then we can set 

, . P(j >- I) + c . , 
Aji = log — for observed pairs j, t, 

¥(j -< Z) + c 

where P denotes the empirical distribution over {Y\, . . . , Y^} and c > is a smoothing parameter. 

Since the proposed structure s is a continuous function of the skew-symmetric matrix A, the 
limiting distribution \i is a point mass whenever A converges almost surely, as it does in the BTL 
model. If aggregation is carried out using only a finite number of preferences rather than letting k 
approach oo with n, then /i converges to a non-degenerate distribution. Theorem Q] grants uniform 
consistency since the score space S is finite. 

Borda count and budgeted aggregation The Borda count [3] provides a computationally 
efficient method for computing scores from election results. In a general election setting, the 
procedure counts the number of times that a particular item was rated as the best, second best, 
and so on. Given a skew-symmetric matrix A representing the outcomes of elections, the Borda 
count assigns the scores s = Al. As above, a skew-symmetric matrix A can be constructed from 
input preferences {Y\, . . . , Yj.}, and the choice of this first-level aggregation can greatly affect the 
resulting rankings. Ammar and Shah [l| suggest that if one has limited computational budget and 
only pairwise preference information then one should assign to item j the score 



^y^PO^Z), 



m 



which estimates of the probability of winning an election against an opponent chosen uniformly. 
This is equivalent to the Borda count when we choose Aj\ = F(j y I) — P(j -< I) as the entries in 
the skew-symmetric aggregate A. 
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Principal eigenvector method Saaty [39|] describes the principal eigenvector method, which 
begins by forming a reciprocal matrix A G M mxm , with positive entries A{j = (Aji) , from pairwise 
comparison judgments. Here Aij encodes a multiplicative preference for item i over item j; the idea 
is that ratios preserve preference strength [3(3]. To generate A, one may use, for example, smoothed 



empirical ratios Aji = Sr^n +C ■ ISaatvl recommends finding a vector s so that Sj / Sj « j4jj , suggesting 
using the Perron vector of the matrix, that is, the first eigenvector of A. 

5.2 Cascade models for selection data 

Cascade models [l~4l . 0] explain the behavior of a user presented with an ordered list of items, for 
example from a web search. In a cascade model, a user considers results in the presented order 
and selects the first to satisfy him or her, and the model assumes the result I satisfies a user with 
probability pi, independently of previous items in the list. It is natural to express a variety of 
ranking losses, including the expected reciprocal rank (ERR) family (|16p . as expected disutility 
under a cascade model, but computation and optimization of these losses require knowledge of the 
satisfaction probabilities p\. When the satisfaction probabilities are unknown, Chapelle et al. 0] 
recommend plugging in those values p\ that maximize the likelihood of observed click data. Here we 
show that risk consistency for the ERR family is simple to characterize when scores are estimated 
via maximum likelihood. 

To this end, fix a query q and let each affiliated preference judgment Yi consist of a triple 
(rrii, 7Tj, Cj), where mi is the number of results presented to the user, 7Tj is the order of the presented 
results, which maps positions {1, . . . , m^} to the full result set {1, . . . , m}, and q G {1, . . . , mi + 1} 
is the position clicked on by the user (m; + 1 if the user chooses nothing). The likelihood g of an 
i.i.d. sequence {Y\, . . . , Y k } under a cascade model p is 



9 (p, ^ Y k}) = n^g m n a - p^)), 

r- 



and the maximum likelihood estimator of the satisfaction probabilities has the closed form 

pi{Y u ...,Y k ) - -. 

£i=i £/=i 1 fau) =0 

To incorporate this maximum likelihood aggregation procedure into our framework, we define 
the structure function s to be the vector 

s(Y 1: ...,Y k ) := p(Y\, . . . , Yfc) G M. m 

of maximum likelihood probabilities, and we take as our loss L any member of the ERR family (|16p . 
The strong law of large numbers implies the a.s. convergence of p to a vector p G [0, l] m , so that 
the limiting law Hq{{p}) = 1. Since \i q is a product measure over [0, l] m , Lemma [1] implies that 
any a inducing the same ordering over results as p minimizes the conditional ERR risk £(a, /i). By 
application of Theorems [1] (or Proposition [2]) and HI it is possible to asymptotically minimize the 
Expected Reciprocal Rank by aggregation. 

5.3 Structured aggregation 



Our framework can leverage aggregation procedures [see, e.g.,[21[ that map input preferences into 



representations of combinatorial objects. Consider the setting of Sec. 13.21 in which each observed 
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preference judgment Y is the weighted adjacency matrix of a directed acyclic graph, our loss of 
interest L is the edgewise indicator loss ()10p , and our candidate surrogate losses have the form (I18p . 
Theorems [2] and [3] establish that risk consistency is not generally attainable when s(Y\, . . . , Y&) = Y\. 
In certain cases, aggregation can recover consistency. Indeed, define 

k 

s(Y 1 ,...,Y k ) -=\Y,Yu 

i=l 

the average of the input adjacency matrices. For an i.i.d. sequence Y±,Y2, . . . associated with a 
given query q, we have s(Yi, . . . ,Y n ) -4-' E(F | Q = q) by the strong law of large numbers, and 
hence the asymptotic surrogate risk 

Mf) = E^ / s )^(*) = ^pMHq)My I Q = ?))■ 

q J q 

Recalling the conditional pairwise risk ([lip , we can rewrite the risk as 

R (f) = £*« f £ Y S n (/*(*) ^ fM) + E y ^ n < to)) 

q L «<i «>i 

= e>*E e [^ i s = +E^E E ^- - y * i q = ?]* (/*(?) < • 

9 i>j q i<j 

The discussion immediately following Proposition [2] shows that any consistent surrogate ip must 
be bounded away from its minimum for a argmin a / £^,(a',fj,). Since the limiting distribution fi is 
a point mass at some adjacency matrix s for each q, a surrogate loss cp is consistent if and only if 

inf < <p(a, s) — inf <p(a', s) I a ^ argminL(a', s) > > 0. 
a [ a ' a , J 

In the important special case when the difference graph associated with E[Y | Q = q] is a 
DAG for each query q (recall Section f3.2.2|) . structure consistency is obtained if for each a* £ 
argmin a (p(a, s), sign(a* — a*-) = sign(sjj — Sji) for each pair of results i,j. As an example, in this 
setting 

tp(a, s) := ^2 l s ij ~ s ji\+ 4>{oii - a>j) (23) 

is consistent when (f> is non-increasing, convex, and has derivative 4>'(0) < 0. 

The Fisher-consistent loss (I23p is similar to the inconsistent losses (]12p considered in Section [321 
but the coefficients adjoining each (j){a,i — atj) summand exhibit a key difference. While the incon- 
sistent losses employ coefficients based solely on the average i — > j weight Sij, the consistent loss 
coefficients are nonlinear functions of the edge weight differences Sy — Sji'. they are precisely the 
edge weights of the difference graph introduced Section 13.2.21 Since at least one of the two 
coefficients [sy — SjJ+ and [sji — Sij] + is always zero, the loss ([23]) penalizes misordering either edge 
i — > j or j — > i. This contrasts with the inconsistent surrogates of Section [3.2[ which simultaneously 
associate non-zero convex losses with opposing edges i — > j and j — > i. Note also that our argument 
for the consistency of the loss (|23h does not require Definition [4js low- noise assumption: consistency 
holds under the weaker condition that, on average, a population's preferences are acyclic. 
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6 Experimental Study and Implementation 



In this section, we describe strategies for solving the convex programs that emerge from our ag- 
gregation approach to ranking and demonstrate the empirical utility of our proposed procedures. 
We begin with a broad description of implementation strategies, which we follow with our specific 
experiments. 



6.1 Minimizing the empirical risk 



At first glance, the empirical risk (I19h appears difficult to minimize, since the number of terms 
grows exponentially in the level of aggregation k. Fortunately, we may leverage techniques from 
the stochastic optimization literature [35l . 1 181 ] to minimize the risk (|19f) in time linear in k and 
independent of n. Let us consider minimizing a function of the form 



R N (f) ■-- 



1 N 

^ 5>(/y) + *(/), 

i=l 



(24) 



where {s l }^L l is some collection of data, <p(-,s) is convex in its first argument, and $ is a convex 



[possibly zero). 

181 ]. using ideas similar to those of Nemirovski et al. 35J, develop a specialized 



regularizing function 
Duchi and Singer 

stochastic gradient descent method for minimizing composite objectives of the form (|24p (providing 
a unified and sharper analysis in the subsequent paper (2o|). Such methods maintain a parameter 
/ , which is assumed to live in convex subset J 7 of a Hilbert space with inner product (•,•), and 
iteratively update /* as follows. At iteration t, an index it £ [N] is chosen uniformly at random 
and the gradient V , s lt ) is computed at /*. The parameter / is then updated via 



t+i 



argmin 



1 



2r) t 



where r/ t > is an iteration-dependent stepsize and 
guarantees of the update (f25j) are well-understood 



/Tj, ( 25 ) 

denotes the Hilbert norm. The convergence 



Define f 



(i/r)ELi/* to be 

the average parameter after T iterations. If the function R n is strongly convex — meaning it has at 
least quadratic curvature — the step-size choice rjt <x 1/t gives 



E 



mfR N (f) 



,1 



where the expectation is taken with respect to the indices it chosen during each iteration of the 
algorithm. In the convex case (without assuming any stronger properties than convexity), the 
step-size choice r\t oc l/vt yields 



E 



R N (f T ) 



mfR N (f) 



O 



1 



These guarantees also hold with high probability [2C 

Neither of the convergence guarantees 1/T or 1/VT depends on the number N of terms in 
the stochastic objective (f2"4"|). As a consequence, we can apply the composite stochastic gradient 
method (i25|) directly to the empirical risk (fl9l) : we sample a query q with probability n q /n, after 
which we uniformly sample one of the (^) collections {i±, . . . , i^} of k indices associated with query 
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q, and we then perform the gradient update ([25]) using the gradient sample V </?(/*, s(Yi 1 , . . . , Yi k )). 
This stochastic gradient scheme means that we can minimize the empirical risk in a number of 
iterations independent of both n and k; the run-time behavior of the method scales independently 
of n and depends on k only so much as computing an instantaneous gradient V(p(f, s(Yi, . . . , Ifc)) 
increases with k. 




0.5 1 1.5 2 2.5 3 0.5 1 1.5 2 2.5 3 



Iterations x, °* Iterations *" 

Figure 2. Timing experiments for different values of k and n when applying the method (|25[) . The 
horizontal axes are the number of stochastic gradient iterations, the vertical axes are the estimated 
optimality gap for the empirical surrogate risk. Left: varying amount of aggregation k, fixed n = 
4 • 10 5 . Right: varying total number of samples n, fixed k = 10 2 . 



In Figure El we show empirical evidence that the stochastic method (J25]) works as described. In 
particular, we minimize the empirical [/-statistic-based risk (|19p with the loss (|28p we employ in 
our experiments in the next section. In each plot in Figure El we give an estimated optimality gap, 
Rip,n(f t ) — hrf/gj- R<p,n(f), as a function of t, the number of iterations. As in the section to follow, 
T consists of linear functionals parameterized by a vector 6 £ W 1 with d = 136. To estimate inffgjr, 
we perform 100,000 updates of the procedure (|25l) . then estimate inf/ e jr R^^f) using the output 
predictor / evaluated on an additional (independent) 50,000 samples (the number of terms in the 
true objective is too large to evaluate). To estimate the risk R^^if*), we use a moving average of 
the previous 100 sampled losses <p(f T , s lT ) for r € {t — 99, . . . , t}, which is an unbiased estimate of 
an upper bound on the empirical risk Ri P ^ n {f t ) (see, e.g. [8j]). We perform the experiment 20 times 
and plot averages as well as 90% confidence intervals. As predicted by our theoretical results, the 
number of iterations to attain a particular accuracy is essentially independent of n and k; all the 
plots lie on one another. 



6.2 Experimental evaluation 

To perform our experimental evaluation, we use a subset of the Microsoft Learning to Rank WeblOK 



dataset [36(], which consists of 10,000 web searches (queries) issued to the Microsoft Bing search 



engine, a set of approximately 100 potential results for each query, and a relevance score r £ I 
associated with each query/result pair. A query/result pair is represented by a d = 136-dimensional 
feature vector of standard document-retrieval features. 
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To understand the benefits of aggregation and consistency in the presence of partial preference 
data, we generate pairwise data from the observed query/result pairs, so that we know the true 
asymptotic generating distribution. We adopt a loss L from the NDCG-family (|14p and compare 
three surrogate losses: a consistent re gres sion surrogate based on aggregation, an inconsistent but 
commonly used pairwise logistic loss 17j], and a consistent loss that requires access to complete 
preference data [37]]. Recalling the NDCG score (|14p of a prediction vector a 6 M m for scores 
s 6 M. m (where ir a is the permutation induced by a), we have the loss 

where Z(s) is the normalizing value for the NDCG score, and F(-) and G(-) are increasing functions. 

Given a set of queries q and relevance scores n £ M, we generate n pairwise preference ob- 
servations according to a Bradley- Terry-Luce (BTL) model 0]. That is, for each observation, we 
choose a query g uniformly at random and then select a uniformly random pair of results to 
compare. The pair is ordered as i y j (item i is preferred to j) with probability pij, and j y i with 
probability 1 — Pij = Pji, where 

1 + exp(r; - r,) 

for r, and rj the respective relevances of results i and j under query q. 

We define our structure functions Sk as score vectors in M m , where given a set of k preference 
pairs, the score for item % is 

SfcW = 7 

m — 1 z — ' 



3& 



the average empirical log-odds of result i being preferred to any other result. Under the BTL 
model (|26p . as k — > oo the structural score converges to 

s(i) = V [log(l + exp(ri - rj)) - log(l + exp(r i - r<))] • (27) 



In our setting we may thus evaluate the asymptotic NDCG risk of a scoring function / by computing 
the asymptotic scores (|27p . In addition, Corollary [1] shows that if all minimizers of a loss obey the 
ordering of the values 

f G(s(j)) ^ 

/ „ ( x d^{s), je l,...,m 
Js Z (s) 

then the loss is Fisher-consistent. A well-known example Ljl, 37] of such a loss is the least-squares 
loss, where the regression labels are G(sj)/Z(s): 

^•) = s;g(^-W- < 28 > 

We compare the least-squares aggregation loss with a pairwise logistic loss natural for the pairwise 
data generated according to the BTL model (|26j) . Specifically, given a sample pair with i y j, the 
logistic surrogate loss is 

<p(a, iy j) = log (1 + exp(aj - a»)) , (29) 
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28, 



which is equivalent or similar to previous losses used for pairwise data in the ranking literature 

For completeness, we also compare with a consistent surrogate that requires access to complete 
preference information in the form of the asymptotic structure scores (|27p . Following Ravikumar 



et al. [3j|, we obtain such a surrogate by granting the regression loss ([28]) direct access to the 



asymptotic structure scores. Note that such a construction would be infeasible in any true pairwise 
data setting. 

Having described our sampling procedure, aggregation strategy, and loss functions, we now 
describe our model. We let x\ denote the feature vector for the ith result from query q, and we 
model the scoring function f(q)i = {9,x^} for a vector 9 £ M. d . For the regression loss (|28|) . we 
minimize the [/-statistic-based empirical risk (|19p over a variety of orders k, while for the pairwise 
logistic loss (I29p . we minimize the empirical risk over all pairs sampled according to the BTL 
model (|26p . We regularize our estimates by adding $>(0) = (A/2) ||0||2 to the objective minimized, 
and we use the specialized stochastic method (|25p to minimize the empirical risk. 

Our goals in the experiments are to understand the behavior of the empirical risk minimizer as 
the order k of the aggregating statistic is varied and to evaluate the extent to which aggregation 
improves the estimated scoring function. A secondary concern is to verify that the method is 
insensitive to the amount A of regularization performed on 6. We run each experiment 50 times 
and report confidence intervals based on those 50 experiments. 

Let #^ c | denote the estimate of 9 obtained from minimizing the empirical risk (1191) with the 

regression loss (|28p on n samples with aggregation order k, let 9n S denote the estimate of 9 obtained 
from minimizing the empirical pairwise logistic loss (I29p . and let # fu11 denote the estimate of 9 
obtain from minimizing the empirical risk with surrogate loss ([28)) using the asymptotic structure 
scores ([27]) directly. Then each plot of Figure [3] displays the risk R{9™\) as a function of the 

aggregation order k, using R(9n g ) and R(9 iu11 ) as references. The four plots in the figure correspond 
to different numbers n of sample pairs. 

Broadly, the four plots in Figure [3] match our theoretical results. Consistently across the plots, 
we see that for small k, it appears there is not sufficient aggregation in the regression-loss-based 
empirical risk, and for such small k the pairwise logistic loss is better. However, as the order of 
aggregation k grows, the risk performance of #^ c | improves. In addition, with larger sample sizes n, 

the difference between the risk of 9n g and (9^ e | becomes more pronounced. The second salient feature 
of the plots is a moderate flattening of the risk R(9^f) and widening of the confidence interval for 
large values of k. This seems consistent with the estimation error guarantees in Propositions [5] 
and [6] where the order k being large has an evidently detrimental effect. Interestingly, however, 
large values of k still yield significant improvements over R(9 l ° g ). For very large k, the improved 
performance of #^ e | over # fu11 is a consequence of sampling artifacts and the fact that we use a finite 
dimensional representation. (By using sufficiently many dimensions d, the estimator # fu11 attains 
zero risk by matching the asymptotic scores (|27p directly.) 

Figure 0] displays the risk R{9™\) for n = 800000 pairs, k = 100, and multiple values of 

1 1 2 

the regularization multiplier A on \\9\\ 2 - The results, which are consistent across many choices 
of n, suggest that minimization of the aggregated empirical risk (I19D is robust to the choice of 
regularization multiplier. 
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Figure 3. NDCG risk and 95% confidence intervals for estimated using the logistic pairwise 
loss and the [/-statistic empirical risk with <p chosen to be regression loss ([25)) . The horizontal 
axis of each plot is the order k of the aggregation in the [/-statistic (fT9")) . the vertical axis is the NDCG 
risk, and each plot corresponds to a different number n of samples, (a) n = 2 • 10 5 (b) n = 4 • 10 5 (c) 
n = 8-10 5 (d) n = 1.6-10 6 . 
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Figure 4. NDCG risk and 95% confidence intervals for 9 estimated using the [/-statistic empir- 
ical risk (|T9l with ip chosen as the regression loss (|28|) under various choices of the regularization 
parameter, A. 



7 Conclusions 

In this paper, we demonstrated both the difficulty and the feasibility of designing consistent, 
practicable procedures for ranking. By giving necessary and sufficient conditions for the Fisher- 
consistency of ranking algorithms, we proved that many natural ranking procedures based on 
surrogate losses are inconsistent, even in low-noise settings. To address this inconsistency while 
accommodating the incomplete nature of typical ranking data, we proposed a new family of sur- 
rogate losses, based on [/-statistics, that aggregate disparate partial preferences. We showed how 
our losses can fruitfully leverage any well behaved rank aggregation procedure and demonstrated 
their empirical benefits over more standard surrogates in a series of ranking experiments. 

Our work thus takes a step toward bringing the consistency literature for ranking in line with 
that for classification, and we anticipate several directions of further development. First, it would 
be interesting to formulate low-noise conditions under which faster rates of convergence are possible 



for ranking risk minimization (see, e.g., the work of Clemengon et al. which focuses on the min- 
imization of a single pairwise loss) . Additionally, it may be interesting to study structure functions 
s that yield non-point distributions fi as the number of arguments k grows to infinity. For example, 
would scaling the Thurstone-Mosteller least-squares solutions (|22j) by yk — to achieve asymptotic 
normality — induce greater robustness in the empirical minimizer of the [/-statistic risk (|19p ? Fi- 
nally, exploring tractable formulations of other supervised learning problems in which label data is 
naturally incomplete could be fruitful. 
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A Proofs of Consistency Results 



A.l Proof of Proposition [T] 

We begin by recalling the definition of the convex conjugate h* of a function h and its biconjugate 

h** 

h*(y) := sup{y:r — h(x)} and h**(x) := sup{xy — h*(y)}, 



noting that h** (x) < h(x) and that h** is closed convex [25J] . The following lemma is useful for the 
proof of Proposition [TJ 

Lemma 2. Let h : [0, 1] — > R+ U {oo} be non- decreasing with h(0) = and h{x) > for x > 0. 
Then h**(x) > for x > 0. 

Proof For the sake of contradiction assume that there is some z £ (0, 1) with h**(z) = 0. Since 
h** is convex with h**(0) = 0, we have /i**([0, z]) = {0}, and in particular h**{z/2) = 0. By 
assumption h(z/2) = b > 0, whence h(l) > b > 0. Thus we can define the closed convex function 
9 as 

Jo if x < z/2 

~ 1 i-z/2 ( x — °th erw i se ) 

and have g(x) < since > h{z/2) = b > 0. It is clear, however, that g(z) > = /i**(z), 
which contradicts the fact [iH| that h** is the greatest closed convex minorant of h. □ 



Proof of Proposition Q] The proof is analogous to proofs of similar results due to Zhang [4{ 
Tewari and Bartlett [451 ]. and Steinwart First, note that the function H is non-decreasing in 
its argument and satisfies H(0) = and H(l + e) = oo for all e > 0. Jensen's inequality implies 
that 

H**(R(f) - R*) = H** (j^Pi [Kf(q),N) ~ inf^(a,/i 9 ) 

V q 

i 



= R v {f) - R* v (30) 

by the definition © of H. Lemma [2] implies that H**(e) > for e > 0, and since H** is closed 
convex and non-negative, it is continuous from the right at so long as H**(e) < oo for some e > 0. 

Now, assume that tp is pointwise consistent. Let f n be a sequence of measurable functions such 
that R^{fn) -> i?*. For shorthand, define /i(e) = H**(e). If iJ**(e) = oo for all e > 0, then 
H{e) > H**(e) = oo for all e > 0. Then the definition © and the finiteness R<p(f) < oo imply that 
£(f(q),/j,) — in£ a i £(a' , fi) = for all \x £ A4(S), and hence 5(e) = oo in Definition [21 
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Thus we can assume that H(e) < oo for some e > 0, and let h'^_(e) denote the right-derivative 
of h at the point e for e € int dom h. This interior is non-empty by assumption, and the right- 
derivative exists and is positive since h is closed convex and H is non-decreasing with H(x) > H(0) 
for x > [i^]. Now fix some e > 0, and let b = h!^_(e/2) > 0. Since h is convex and b is a 
subgradient of h, we have h(e) > h(e/2) + be/2 > 6e/2, so that if /i(<5) < be/2, then <5 < e. Applying 
the bound (p0|) . we can take N £ N such that n > N implies R tf (f n ) — -R* < 6e/2, and as such 
H**(R{f n ) - R*) < be/2 so that R{f n ) - R* < e. 

Now we turn to the converse, where we assume that tp is uniformly consistent. Let {p q } be 
supported completely on a single q. Then 

R(f)-R* =£(f(q),H q )-hif£(a',n q ) 

a' 

and 

RM) ~ K = ^(/O?),/^) - inf e<p(a',n q ) 

a' 

Fix an e > 0, and let a = f(q) £ M m . By the contrapositive of Definition [21 we know that there 
exists 5(e) > such that 

R(f) — R* = £(a,/jL q ) — inf £(a',fj, q ) > e implies 

a' 

RM) -K> = ?A a i Mg) - in ( f ■> > 5 ( e )> 

a' 

which holds for any measurable / (and hence any a £ M. m ). Since the measure fi q was arbitrary, 
we see that the function H (e) defined by the infimum ([8]) satisfies H(e) > 5(e) > for all e > 0. □ 



A. 2 Proof of Theorem CD 

We prove Theorem [T] in two parts: first under Assumption O and second under Assumption ICl 
under the additional condition of 0-coercivity of (recall a function g : M. m — > R is called 



k-coercive [25| if g(x)/ \\x\\ ' — > +00 as ||x|| —> 00.) 

The proof of Theorem [1] using Assumption O requires three lemmas that make it quite straight- 
forward. We begin by recalling the definition ([3|) of the optimal conditional surrogate loss £^(/jl) ■= 
inf Q £tp(a, fj,). We have the following three results, each assuming the conditions of Theorem [T] with 
Assumption [Cl 



Lemma 3 (Zhang [49|, Lemma 27; Tewari and Bartlett [451 ] . Lemma 16). The function is 
continuous in the measure fi. 

Lemma 4. Let ip be structure- consistent and {a n } C W 71 be a sequence of vectors. Then £ ip (a n , fj) — > 
£p(fj>) implies that £(a n ,fi) — > inf a £(a, fi) and for large enough n, the vector a n E arguiin^ £(a, fx). 

Proof Suppose for contraposition that the sequence of vectors {a n } has £(a n , fj,) inf a £(a,/j,). 
Then there is a subsequence nj of a n and some e > such that £(a n .,fj) > inf a £(a,/i) + e for all 
j. Thus we have a n . argmin Q £(a, ji) for any j, and Definition [3] implies that 



£<p(otnj , > inf (i) I a argmin £(a, ji) \ > £^(jj). 

As a consequence, we have £ v (a nj ,n) uniformly bounded from £t,(fj), which is the contrapositive 
of the lemma's first claim. Thus, whenever £ tp (a n ,fJ,) — > C,(/i), we have £(a n ,fi) — > inf a £(a, /i), 
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and since the range of L (and hence £ by Assumption [C]) is finite by assumption, we must have 
a n G argmin a £(a, fJ,) for sufficiently large n. □ 

Lemma 5. Let <p be structure- consistent. Then the suboptimality function |2J) satisfies H(e) > 
for all e > 0. 

Proof An equivalent statement to the result that H(e) > for all e > is that for any e > 0, 
there exists a 5(e) > such that for any fi G A4(S), 

£(a, fi) — inf £(a', (J,) > e implies £,p(a, fi) — I* (//) > 5(e). 

a' 

We give a proof by contradiction using Lemmas [3] and HI Assume that there is a sequence of pairs 
(a n ,fi n ) G M. m x A4(«S) with 

£(a n ^ n )-M£(a',fj, n ) >e and ^(a n , /i n ) - tJfi n ) -> 0. 

a' 

By the compactness of A4(5) (recall 5 is finite), we can take a subsequence nj such that fi n i — >■ fi 
for some /x G M(S). Then Lemma [3] gives £*(fj, n: >) —> £^(n)', this coupled with the fact that 
£ v (a n ,fi n ) - t^(ii n ) ->■ implies 

lim ^ (c% . , ^ ) = lim P (^ ) = I* (p) . (31) 

3 3 

If we can show that lim.j £^(a nj , fi) = £t,(p), then we are done. Indeed, if this is the case then 
Lemma [H implies that a nj G argmin Q £(a, fi) eventually, whence £(a nj ,fi) —wf a >£(a',[J,) = 0. Then 
the continuity of £(a,fi) in fj, contradicts £(a n ,fi n ) — inf Q / £ (a', fi n ) > e for all n. 

To the end of showing lim.,- £ {p (a nj , fi) = £^(fi), fix an arbitrary e' > and recalling that 5 is 
finite, let 5+ denote the set of s G 5 such that /i(s) > 0. Since fi n i — >■ /i, there must be some BsK 
and J G N such that f(ct n . ,s)<B for all j > J and s G <S+. Now, choose J' > J so that j > J' 
implies /u(s) < fi n i(s) + e'/-B|5|. Then 
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/ f(a nj , s)dfi(s) < / (p(a nj ,s)df/ lj (s) + JB|5| 
for j > J', and thus 

limsup£ v ,(a nj , /x) = limsup / tp(a nj ,s)dfi(s) 
j 3 Js+ 

< lim sup / <p(a n , s)dfi nj (s) + e' 

3 Js + 

< lim sup / (p(a n . , s)dfi n > (s) + e' = £*(//) + e' 

i Js 

by using the limit (j3T|) . Since e' was arbitrary, we see that limj £ (p (a nj , /i) = £*(/x) as desired. □ 



Proof of Theorem [TJ under Assumption [C] Part (jaj) of the theorem follows immediately from 
Lemma [5j which implies that for any e > 0, H(e) > 0, whence we can apply Proposition [TJ 

For part (jbj), we focus on the case when {p q } is supported completely on a single q and prove the 
result by contradiction. Assume that (p is not structure consistent, so that there exists a fi G M(S) 
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satisfying 



■C(m) = inf^m(a, /i) = inf < £ v (a, fi) \ a argmin^(o/, fi) > 

Q "I a' J 



There exists a sequence a n G" argmin a £(a, /x) with l v (a n , /i) — ^ £Z,(fJ,)- Define the sequence of 
functions f n with / n (g) = a n , and let /i g = /i. In this case, we have R v (f) = ^(/((J 1 ), but by 
the finiteness of the range of £(■, /i), there is some e > such that £(a n , fi) > inf Q £(a, /i) + e for all 
re. Thus we see that although R v (f n ) — > R^, we have 

R(fn) = £{f n (q),n) = *(<*„,//) A R* = inf %,//). 

Thus we fail to have consistency as desired. □ 
Now we turn to proving the result of Theorem [T] when Assumption [CT] holds, but we also assume 
that for any /i G A4(S), the function a i-> £ v (a,fi) is 0-coercive. We begin by noting that since 
5 is compact, Prohorov's Theorem implies the collection ofprobability measures defined on S is 
compact with respect to the topology of weak convergence [3, Chapter 5]. We let ~» denote weak 
convergence (or convergence in distribution) of measures. To prove the result, all we require is an 
analogue of Lemma [5l 

Before procee ding , we recall a few results from variational analysis, referring to the book of Rock- 
afellar and Wets [38( for a deeper exploration of these issues. For notational convenience, for a 
function g : M m — > M. we define the approximate minimal set 



e-argming := < x G R m : g{x) < inf g(x) + e 

Proposition 4. Let g n : R' m — > M. be a sequence of closed convex functions converging pointwise to 
a 0-coercive function g. Then g is closed convex. Moreover, if e n > is a sequence with e n — > 0, 
then any sequence x n G e n -argming n has a cluster point and 

limsupg(:E n ) < inf g(x). 

n x 

Proof This is a recapitulation of results due to Rockafellar and Wets [3^] and Hiriart-Urruty 



and Lemarechal 251 ] . We begin by noting that since the g n converge pointwise, the function g is 



25|, Thoerem IV. 3. 1.5]. Pointwise convergence of convex g n implies convergence of their 



convex 

epigraphs in the sense of set convergence [38l . Theorem 7.17], and hence g is closed convex 38|, 
Proposition 7.4(a)]. Since g is coercive (by assumption). Theorem 7.33 of [3^] states that there 
exists a bounded set C C W" 1 such that e n -argming n C C for all suitably large re. The result also 
states that for any e > 0, eventually e n -argmin<7 n C e-argmin/, and the sequence of non-empty 
sets e n -argming n has all its cluster points in argmin/. □ 



We now give our promised analogue of Lemma [5l 

Lemma 6. Let tp be structure consistent and assume that for each fi G M(S), the function a *— > 
£ !f (a,/j,) is 0-coercive. Then the suboptimality function ^ satisfies Hie) > for all e > 0. 

Proof The proof follows that of Lemma [5j but we require the somewhat stronger tools of Propo- 
sition HI since S is no longer necessarily finite. 

Assume for the sake of contradiction that there exists e > and a sequence of pairs (a n ,/i n ) G 
W n x M{S) satisfying 

£{a n ,ii n )-in{£{a'^i n ) >e and ^(a n , /i n ) - tj/i n ) -> 0. 



29 



Since A4(S) is compact with respect to weak convergence (Prohorov's theorem), we can take a 
subsequence nj such that y nj ~~> [i for some y € A4(S). 

By weak-convergence, we have the pointwise convergence 

lim^a,// 1 ?) = £Ja,y) for all a £ M m . 

j 

Of course, the funtions £ v> (-, y n ) and ^(-, /i) are convex, and the latter is 0-coercive, whence Propo- 
sition H] applies. We find in particular 

limsup£ (/ ,(a nj ,/x) < inf £ 9 {a, y) = t^y). 
j a 

Lemma [J] holds under Assumption [C^ and the proof is, mutatis mutandis, the same. Thus, Apply- 
ing Lemma IU we find that a nj £ argmin^ £(a, fi) eventually, whence £(a nj ,y) — inf a ' £{a! , /x) = 0. 
Since fi \-t £(a, /j) is continuous, this contradicts the assumption that £(a n , y n ) — inf a / £(a', fj, n ) > e 
for all n, which is our desired contradiction. So H (e) > 0. □ 



Proof of Theorem [TJ under Assumption IC'I and coercivity The proof is completely identical 
to that under Assumption [Cl but now instead of Lemma [5] we apply Lemma [H □ 

A. 3 Proof of Lemma [TJ 

Consider the function 



m - j— 1 



F(i) 

i=l v ; j=l 



and assume that pi < Pi+i for some index i. Let p' = p except that the values of pi and Pi+i are 
swapped. Then 

1 i-i 1 i 

g(p) - g(p') = jrrPi Pj) + F(i + 1) Pi+i II^ 1 " Pi) 

w i=i i=i 

j «— l ^ j— l 

- n^ 1 - p^) - Ffi+TT pi(i " pi+i) n^ 1 - p*) 

V ' .7=1 V ^ .7=1 



i-1 



^yfe-ft+OlI^-Pi) 

1 i-1 

+ Fr - n -k) -K+0) n^ 1 ~Pi) 
* + i=1 



i 



F(i) F(i + 1) 



i=i 



since F is an increasing function. Thus swapping p, and increases the objective g(p), proving 
the lemma. 
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B Proofs of Uniform Laws 



In this section, we provide the proof of TheoremHl The proof has two parts, the first controlling the 
expectation (j20j) , the second controlling the convergence of the empirical risk R\p >n to its expectation 

R<p,n ■ 



B.l Expectation 

We begin by studying the expectation (|2(jp . Our goal is to understand the rate at which we have 



convergence 



iW/) -> E^ E <? s(Y u ...,Y k )\Q = q)] . 



(32) 



If the convergence (|32p occurs sufficiently quickly, then we can allow k to increase to infinity, 
capturing the asymptotic surrogate risk The following proposition, whose proof we provide in 
Appendix IB.31 gives sufficient conditions for such a convergence result to hold. 

Proposition 5. Let Assumptions^ and\M hold for the sequence of function classes T n . There 
exists a constant C(Ki,/3) < oo, such that 



sup 



R<pM) ~ WiQKYu ...,Y k ))\Q = q] 



<C(Kx, /3)B n n i+e (k + logk + 2 log n) 1 + 8 . 



Proposition [5] is suggestive of Condition [TT1 and its requirements on B n and k n . We now show 
how Conditions U or |T] and Condition ITT1 imply that supy e jr n {R^^f) — R ip (f)\ — > 0. First, we show 
how Condition U implies H3 Indeed, since YlqPg = 1, Condition U implies that for any / G T n 



^(/)-E?Ab(/(?).s(^J)] 



E q [ ( p(f(q),s(Y 1:k J)]-limE q [cp(f(q),s(Y 1 .. k/ ))} 

k' 



< C B n k n p , 



which tends to whenever B n = o{kn). Then for any e n > we obtain 



max inf 

i&[N{e n ,n)} feJ* 



RM)-J2pi E Mf(i)^(yi:k n ))} 



< CB n k~e -> 0, 



and we may assume e n L n — > in Condition |Tj This established, the following lemma captures the 
convergence of expectations (see Appendix IB.4|) . 

Lemma 7. Let Condition\T] and k n Bn + ^^ = o{n) hold. Under Assumptions^ and[£j as n — > oo 



sup \R^ n (f)~RM)\ 



B.2 High probability guarantees 

We now turn to demonstrating high-probability convergence of the empirical risk (|19p to its expec- 
tation. Our approach makes use of the smoothness Assumption [E] along with a covering number 
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argument [e.g.,|48|. We use the bounded differences inequality [32| to demonstrate the convergence 
of the empirical risk (|19|) to its expectation. We begin by viewing R^^ n {f) as a random function of 
the n query preference evaluation pairs (Qi, Yi), . . . , (Q n , Yn)\ define 

F((Q 1 ,Y 1 ),...,(Q n ,Y n )) :=R v , n {f)- 

We have the following lemma, whose proof we provide in Appendix IB. 51 

Lemma 8. Assume \<p(f(q), s(Y\, . . . , Y m ))\ < B for all f G T . Then 



\F((Qi,Yi), (Q n ,Y n )) - F((Q U Yi), . . . , (Q', Yl), . . . , (Q n ,Y n ))\ < 



AkB 



for all j and (Q'^Y-). 



As a consequence of LemmaEl the bounded differences inequality [32J implies that if \(f(f(q), s(Y±, . . . , Yfc))| < 
B n for any fixed / € J- n , then 

J2 



R<p,n{f) — Rtp,n(f) 



> e < 2 exp 



ne 



Sk 2 Bl 



(33) 



Using Assumption |Ej we can now apply our covering number argument to get a uniform bound on 
the deviations R^^ n {f) — R V) n(f)- We obtain 



Proposition 6. Let Assumptions\Mand\F\ and Conditioning hold. Th 



en 



sup 



R<p,n(f) - R<p,n{f) 



0. 



See Appendix IB.6I for the proof of the proposition, noting that the bound (|33h guarantees almost 
sure and expected convergence in the conclusion of the proposition under the conditions in the 
remark following Theorem HI 

To complete the proof of Theorem 01 note that Lemma [7] guarantees the uniform convergence 
of the difference of expectations Rtp >n (f) — Rtp(f)', applying the triangle inequality in Proposition [6] 
completes the argument. 

B.3 Proof of Proposition [5] 

We begin by fixing some qo and splitting the summation (I20p into two terms: 



R^nif) = E ^^n(n (? = m)E[^(/(Q), S (Y 1 ,...,Y mAfc )) | Q = q] 

q<q L m=l 



q>qo L m=l 



m)E[<p(f(Q),s(Y 1 ,...,Y mAk )) \ Q = q] 



the sum over queries q < qo and the sum over q > q^. We control each sum in turn, using a Chernoff 
bound to control the first and the fact that ^2 q>qo Pq — > quickly enough as go t 00 (due to the 
power law assumption iDl) for the second. We also use the shorthand notation E q [(p(f(q), s(Yi : jt))] 
for the quantity K[tp(f(Q), s(Y\, . . . , Y^)) \ Q = q] to keep the arguments neater. 

We begin by studying the expectation of the more probable queries, for which we require a 
version of the Chernoff bound: 
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Lemma 9 (Multiplicative Chernoff bound). LetXi be i.i.d. Bernoulli random variables with¥(Xi 
1) = p. For any 8 > 0, 

np5 2 



Y x i < (l-a)np) < 



8=1 



cxp 



As an immediate consequence of the Chernoff bound, we see that for any query g and empirical 



count n q , we have 



k 



,(n q < k) = P n \n q < np q 

np q 

2 V np q 



n q < [ 1 - [ 1 



k 
np q 



np q 



exp 



■^1 + k 



2np q 



For any e G (0, 1), to have E \n q l (n q < k)] < enp q , it is sufficient that 



n > 



Pq 



k + log k + log ■ 



1 



(34) 



Indeed, for such n we have np q > 1 and 



fc-i 



np q 



E [n q l (n q < k)} = Y m¥ n (n q = m) < k¥ n (n q < k) < k exp ( - + k 

m=l 

< k exp (— log k + log e) = e < enp q . 
For fixed e > and n G N, we let go = <7o( e i n ) f° r 



(We, n) := max < q : — 

Pq 



k + log k + log - 



< n 



(35) 



Then for all q < go, since X^m=i m Pn(^g = m) = np q and feP n (n g < /c) < e, we immediately see 
that 

(l-e)np q Eq [<p(f(q),s(Y li ...,Y k ))] 



< 



Y mP re (n g = m)E g [y>(/(g), S (F l5 . . . , Y k ))} 



m=k 
n 

< ]T m¥ n (n q = m)E q [<p(f(q), s(Y u . . . , Y mAk )} 

m=l 



and additionally that 
fe-1 



E 

m=l 



mr m„ = m 



)E, Hf(q),s(Y 1:m ))] + £ mP(n, = m)E q [<p(f(q),s(Y 1:k ))] 



m=k 



< eB n + npqEq [<p(f(q), s(Y u ..., Y k ))\ 
by applying Assumption [E] on the boundedness of (p. We thus see that 
(1 - e) PA s(^i:fc))] < 



<?<<20 



(36) 



<3<90 



<?>90 
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for all / 6 J- n . The sandwich inequality (|36p suggests that we should have our desired convergence 
statement (I32p so long as the bound B n does not grow too quickly. 

Now we use Assumption ID1 to complete the proof by bounding q$ and Y2 q >g Pq- Recalling the 
definition (f35l) of qo, we note that for q > go, 

, . /„ 2(/c + log/c + loge~ 1 ) 



n 
Let 

i 



2(fc + log A; + loge- 1 ) 
denote the g solving K\q~^~ l = 2(£; + log A; + loge _1 )/n. Then 

yp q < dq + / K\q dq 

q > qo Jl n J lo 



k + log k + log \ K\ /fe + log/c + logiy^ 

re p \ n 

where C is a constant dependent on K\ and p\ Lastly, we bound (?o(e, n). Since p g < i^ig - ' 3-1 , the 
inequality ([34"]) can be satisfied only if 



< 



2(fc + logfe + loge~ 1 ; 



Choosing e to be a function of re via e = 1/re 2 , we can use Assumption lEl and the sandwich 
inequality (f36l) to find that there is constant C — dependent on K\ and /3 — such that 

Wfe). -(**))] - (n- + ( fc + lQgfc n +2lQgW ) ^ 

< ^,n(/) (37) 

< [<p(J(q), s(Y 1:k ))] + C'B n ii^ (A; + log k + 2 log re)"^ 
+ C B n n l +? (A; + log fc + 2 log re) . 



The two-sided bound (|37|) implies the proposition. 
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B.4 Proof of Lemma \7\ 

Fix e > and let £ T % n be an arbitrary member of each of the sets T\. Then 



sup 

fer n 



max sup 

ie[N(e,n)] /eJ ri 



< max 

ie[N(e,n)] 



^(/)-^p ? E 3 [^(/(g), S (y 1:fcn ))] 

Q 

J> g (E g b(/(g), s(yi :fcB ))] - E g [^(/*(g), 



+ max sup 

te[JV(e,n)] /6 jri 



+ max sup 

i£[N{e,n)} /gJ rt 

by the triangle inequality. Applying Assumption [E] and that T l n has radius e, the final two terms 
are each bounded by L n e, which implies the bound 



max 

ie[AT(e,ra)] 



i 

Noting that f % n was arbitrary, we can strengthen the bound (|38jl to 

^(/)-^PA[^(/(g), S (F 1:fe J)] 



(38) 



sup 



< max inf 

iG[JV(e,n)] /e^ 



+ 2L n e. 



Applying Proposition [5] completes the proof. 
B.5 Proof of Lemma [8] 

Without loss of generality, let j = 1, and fix the query judgment pairs (Qi,Y±) = (q,Y) and 
(Q'^Yl) = (q',Y') with Q\ := Q { for all i £ {2,...,n}. Recall the notation B(q) from Section gj 
and for any r G Q, let £>'(r) := {i € [n] | — r } an d define nj. = |B'(r)|. In addition, let Yi r / lk 
denote Y h ,... ,Y ik . 
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We focus first on the case in which q ^ q'. Note that by definition ([19 
nF((q, Y), (Q 2 , Y 2 ),..., (Q n , Y n )) - nF((q', Y'), (Q 2 , Y 2 ), (Q n ,Y n )) 



n. 



-i 



v(f(Q),s(Yi lHk ))-(n q -l) 



n g - 1 



(39a) 



rtf(q'),s(Y tl .. lk ))-(n g/ + V 



n q , + 1 
k 



%-eB'( ? ) 
l 



£ (39b) 



We first bound the term (|39ap . When n q < k, this term becomes 

n q (p(f(q),s(Y ii:iri(i )) - (n q -l)<p(f(q),s(Y i2 . inq )) 

where ix,...,i^ denote the distinct elements of B(q). By assumption, each of the terms if is 
bounded by B, so this difference is at most 2n q B < 2kB. 
When n q > k, we note that the term (j39aH equals 



n B 



n, 



(n q - 1) 



n q -l 



*1 • 

ij€B>{q) 



n, 



<p(f(q),s(Y,Y lr . lk )). 



(40a) 
(40b) 



Since there are ( n<7 fc x ) terms in the summation (|40ap and Cl 9 ^ 1 ) terms in the summation (|40bp . 
and for any m, k E N with m > k, 



m 



(m — 1) 



m — 1 

it 



fc!(m-fc)! , ,(m-l-fc)!jfe! 
(m — 1)- 



(m- 1)! 



(m - 1)! 



k 



m — 1 



we use the fact that |</j| < B to conclude that 



^ ^(/(?), *(y ii:ik )) 



i 3 eB( g ) 



< \k - 1\B + n 



n q -l 



n q -l 



^2 ¥>(/(?), «(*W) 



k - 1 



B = \k - l\B + n q —B < 2kB. 

n„ 



Hence, the term (|39ap can be bounded by 2kB for any choice of n q . Bounding the term (|39bp 
requires an entirely similar combinatorial argument and again yields the bound 2kB for a total 
bound of 4kB. 



3G 



All that remains is to control the difference of the functions F when q = q'. When n q < k, this 
difference is given by 



(Vn)k(/(g),8(r il! i.))-¥'(/(g),8(l? 1!i!ii )|<2fcB/n. 



When n„ > k, we have 



nF((q, Y), (Q 2 ,Y 2 ), . . . , (Q n ,Y n )) - nF((q, Y 1 ), (Q 2 ,Y 2 ), . . . , (Q n , Y n )) 



n, 



n, 



n, 



n, 



W(q),s(Y iv . lk ))-v(f(q),s(Y> i:lk ))] 



W(q),4Y,Yir.i k ))-v(f(q),s(Y',Y i2:ik ))) 



(41) 



1=^1 < ■ ■ ■ <ifc j 
i/eS(q) 



There are ( T ^*_ 1 ) terms in the final summation, and since 



m 



m — 1 



the difference (|4T|) is bounded by 2/ci?. 



A;!(m — A;)! (m — 1)1 



ml {k — l)!(m — fc)! 



B.6 Proof of Proposition [6] 

Let e' > be arbitrary. Use Assumption iFl and partition J- n into A = N(e',n) < oo subsets 
J 7 *, . . . jJ-'n, and fix /* G J 7 ^ so that for any / G J 7 ^ we have ||/ — f l \\ < e'. Then for any index 
i £ {1, . . . , A(e', n)}, we see that 



sup 

feJ* 



Rip,n(f) — Rip,n(f) 



sup 



< 



+ sup 



Rip,n{f) — Rip,n{f l ) + Rtp,n(f l ) ~ R<f>,n{P) + Rtp,n(f l ) ~ R<p,n(f) 
R<p,n(f) - R<p,n{f) 

Assumption lEl guarantees that R v ^ n {f) — R\p )n (f l ) < L n ||/ — f l \\, so 

sup R^nif) ~ R<p,n(f 



R v ,n(f) ~ R<p,n(f) + SUP K,n(/) " R<P,n(f)\ • 

fen 



The same argument applies to the difference Rip,n{f) — R<p.n(f 1 )- Thus we use the fact that the 
classes T % n partition F n , that is, \jf =l F^ 2 F n , and apply a union bound and the triangle inequality 
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to find that 



sup 



R<f>,n{f) — Rip,n(f) 



> e 



max sup 

ie{l,~A(e»} feJ^ 



R<p,n{f) — R<p,n(f) 



> e 



N(e',n) / 

< ]T p su p. %M) - RvAf) 



i=i \/e^ 

N(e',n) 



> e 



< 



E 

i=l 



Rf,n{f ) Rtp,n{f 



>e-2L n e') . 



Choosing e' = e/(4L n ), we can apply the concentration inequality (|33p to conclude that 

,2 



sup 



Ry,n{f) — R<p,n{f) 



> e < 2 exp 



log JV 



4L r 



, n 



ne 



32k 2 Bf 



This implies the statement of the proposition. 



C Proofs of Inconsistency Results 

The proofs in this section are essentially present in the preliminary version ; 
present and extend them here for the convenience of the reader. 



of this work. We 



C.l Proof of Proposition [3] 



This proposition is a consequence of the fact that the feedback arc set problem [29(] is iVP-complete. 
In the feedback arc set problem, we are given a directed graph G = (V, E) and an integer k and 
need to determine whether there is a subset E 1 C E with \E'\ < k such that E' contains at least 
one edge from every directed cycle in G (equivalently, whether G' = (V,E\E') is a directed acyclic 
graph (DAG)). 

Now consider the problem of deciding whether there exists an a with £(a, /x) < k, and let 
G M denote the graph over the nodes associated with adjacency matrices Y, where G^ has edge 
weights equal to the average = J Yijdfi(Y). Since a induces an order of the nodes in this 
"expected" graph G^, this is equivalent to finding an ordering of the nodes ii,...,i n (denoted 
i± -< Z2 -4 ■ ■ ■ -i i n ) in G u such that the sum of the back edges is less than k, 



E 



v — 



Removing all the back edges (edges (i — > j) in the expected graph G a such that i y j in the original 
ordering) leaves a DAG. Now, given a graph G = (V, E), we can construct the expected graph G^ 
directly from G with weights Yfj = 1 if (i — > j) G E and otherwise (set the probability that edge 
(i — > j) appears to be 1/\E\ and set the associated ijth adjacency matrix Y to be Y = except 
that Yij = \E\). Then there is an a such that £(a,n) < k if and only if there is a feedback arc set 
E' with \E'\ < k. 
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C.2 Proof of Theorem H 

Before proceeding to the proof of Theorem [2] proper, we make a few remarks to reduce the space of 
functions (ft that we must consider; we also argue that h must be continuous and strictly increasing. 

First, if the surrogate ip from the definition (|12p is structure-consistent, it must be classification 
calibrated: (ft must be differentiable at with (ft'(0) < 0. This is a consequence of Bartlett, Jordan, 
and McAuliffe's analysis of binary classification and the correspondence between binary classifica- 
tion and pairwise ranking [jj. The same reasoning implies that we must have h > on M + and h 
strictly increasing. We now argue that h must be continuous. Assume for the sake of contradiction 
that h is discontinuous at some c > 0, and without loss of generality assume that h is not right 
continuous at c. Then there exists an e > such that h(c + S) > h{c) + e for all 5 > 0. Fix 5 and 
7 such that 

e 47 

7 < L an °- ^ ^ 



4/i(c) + 2e l-2 7 

Consider the case in which only two items are to be ranked, 1 and 2, and let the preference Y\ 2 = c+5 
appear with probability 5 — 7 and Y 2 \ = c + 5 appear with probability ^ + 7 and Y\ 2 = c. In this 

case, 

(1 — 27)5 < 47c or ( - — 7 ] 5 — 7c < 7c, so ( — — 7 ) (c + 5) < ( — + 7 J c. 



Any correct score assignment must thus satisfy «i > 02- On the other hand, the condition on 7, e, 
and h(c) implies 2jh(c) < — 7) e, so 

i+ 7 ) h(c)< Q- 7 ) (/»(c) + e)< Q- 7 ) &(c + *) 

by assumption on the discontinuity of h. In particular, any minimizer of + r y)h(c)(ft(ai — 02) + 
— 7)/i(c + 5)(^>(a2 — ai) will in this case set 02 > ai since (ft must be classification calibrated. In 
this case, the discontinuity of h forces mis-ordering, and hence h must be continuous. 



The desired properties of h established, consider the recession function 25(] of 



cftitd) - m v </>(td) - m 

For any bounded below with </>'(0) < 0, we have that 0^(1) > 0. Throughout this analysis, 
we require the weaker assumption that (ft decreases more slowly in the positive direction than the 
negative: 

^(1) >0 or 0' oo (-l) = oo. (42) 

Consider two DAGs on nodes 1, 2, and 3 that induce only the four penalty values Y12, Y13, 
Y23, and Y31 (recall Figured]). In this case, if Y\% > Y31, any a minimizing £(a,fi) must satisfy 
oi\ > ct2 > 03. We now show under some very general conditions that if ip is edge-consistent, (ft is 
non-convex. 

Let (ft'{x) denote an element of the subgradient set d<ft{x) and recall the definition hij = 
J h(Yij)d/j,. The subgradient conditions for optimality of 

i v (a, fi) = h 12 (ft{ai - a 2 ) + h 13 <ft(ai - a 3 ) + h 23 (ft{a2 - a 3 ) + h 3 i(ft(a 3 - ax) (43) 
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are that 

= h 12 <j>'(ai - a 2 ) + ^130' («i - a 3 ) - h 3i 4>' (a 3 - a{) 

= -hi2(f>'(ai - q 2 ) + h 23 (f>'(a 2 - ct 3 ). (44) 



We begin by showing that whenever the condition (|42p holds for <f>, there is a finite minimizer of 
£<p(ot, //) as defined by ((15]) . The lemma is technical, so we provide its proof in Appendix IC.41 

Lemma 10. Let the condition /ioW and 4>'(0) < 0. There is a constant C < oo and a vector 
a* minimizing £ l p(a, jj) with ||ck* || ^ < C. 

The next lemma provides the essential lever we use to prove the theorem; we provide its proof in 
Appendix IC.5I 

Lemma 11 (Inconsistency of convex losses). Suppose that Y\ 3 > Y31 > 0, Y\ 2 > 0, Y23 > 0. Let 

£(a, n) = Y 12 l (ai < a 2 ) + Y 13 l (a\ < a 3 ) + Y23I ("2 < "3) + *3il ("3 < «i) 
and £< f (a,n) be defined as in Eq. j^3| ). For convex 4> with 4>'(0) < 0, we /iaue 

= inf < F,(a,/i) I a ^ argmin€(a:', u) \ 
a I a' J 

whenever either of the following conditions is satisfied: 

_, h-iih-io „ , /isi /i 2 3 

Condition 1: /123 < 1 : — or Condition 2: h± 2 < 



hl3 + /ll2 ^13 + ^23 

Lemma [11] allows us to construct scenarios under which arbitrary pairwise surrogate losses 
with convex (f> are inconsistent. Indeed, assume for the sake of contradiction that ip is structure- 
consistent. Recall that for <p convex, (f>'(0) < by classification calibration. We will construct graphs 
G\ and G 2 (with associated adjacency matrices Y Gl and Y G2 ) so that the resulting expected loss 
satisfies Condition 1 of Lemma [TT] while additionally satisfying Definition Ufs low-noise condition. 
Consider the following two graphs: 

G 1 = ({1, 2, 3}, {(1 -)■ 2) , (1 -> 3)}) , G 2 = ({1, 2, 3}, {(2 -)■ 3) , (3 -> 1)}) . 

Fix any weights F^ 1 , Y^ 1 , Y^ 2 with Y"^ 1 > Y^ 1 > and Y^ 1 > Y 3 ( f > 0, and let jx place half its 
mass on G\ (or Y J ) and half its mass on C?2. As /i is continuous with h(0) = 0, there exists some 
e > such that fc(e) < 2h 31 h 12 /(h 13 + fci 2 ), where fcy = \h{Y? 1 ) + \h(Y? 2 ) as in the definition of 

the surrogate risk £ v . Take Y 2 % 2 = min{e, (Y^ 1 - Y^)/2}. Then we have 

h 23 = l -h{Yg 2 )<h{e)/2< k ^ 



2 V M ' ~ W ' /113 + /»12 

Hence Condition 1 of Lemma [TT1 is satisfied, so ip is not edge-consistent. Moreover, the fact that 

V G 2 <- J 13 J 12 , v Gi _ v Gi 
1 23 — 2 ^13 2 12 

implies that the expected difference graph G^ satisfies Definition HJ 
C.3 Proof of Theorem O 

Assume for the sake of contradiction that ip is structure-consistent. As in the proof of Theorem [2] 
the function h must be strictly increasing, and we let c G M belong to the image h((0, 00)) and define 
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the shifted function 4>(x) = (f)(x — c). By the reduction to the binary case, <j) must be classification 
calibrated, whence (ft must be differentiable at and satisfy 4>'(0) < 0. Using a technique similar to 
that in our proof of Theorem [2l we now construct a setting of four graphs and provide probabilities 
of appearance satisfying the conditions in Lemma [TT] and Definition [H 

Consider the following four graphs on nodes {1,2,3}, each with one edge: 

d = ({1,2,3},{(1^2)}), G 2 = ({l,2,3},{(2->3)}), 
G 3 = ({1,2,3},{(1 -> 3)}), G 4 = ({1,2,3}, {(3 -> 1)}). 

Choose constant edge weights Y^ 1 = Y^ 2 = Y^ 3 = Y^ = /i~ 1 (c) > {h is increasing and thus 
invertible), and set the probabilities of appearance to be \i = (.25, .01, .5, .24). Recalling <j), we have 

t<p{a, n) = n(Gi)4>(a\ - a 2 ) + /^(G 2 )0(a 2 - a 3 ) 

+ n{G^)4>(ai - a 3 ) + /i(G 4 )^(a 3 - a\). 

Notably, <fi is convex and satisfies the recession condition (142ft . Moreover 

Y 13 - Y 31 = h~ l (c)(ii{Gz) - /i(G 4 )) > h~\c) (/i(Gi) + /i(G 2 )) = F 12 + y 23 > 0, 

so G^ is a DAG satisfying the low-noise condition. However, the probabilities [i satisfiy 

Hence, by Lemma [TT| we have the contradiction that 

£p(fJ<) = inf |^(a,/x) | a ^ argmin£(a, 



Let {aW}^.[ C K 3 be a sequence satisfying ^(c^™), /i) — ^ £^(fj,). Suppose for the sake of contra- 



C.4 Proof of Lemma 1101 

diction that limsup„(c^ — ) = oo for some The convexity of <f> coupled with 4>'(0) < 
imply that limsup n 0(a^ — a\ ) = oo, and the recession condition (|4"2"1) . that ^(l) > or 

4>'oo(~ 1) = °°i guarantees limsup n £ </ ,(a^,/i) = oo whenever i > j or z = 1 and j = 3. We thus 
have two remaining cases: 

(a) limsup(a[ ?1 ' > — a^) = oo or (b) limsup(a 2 n ^ — ) = oo. 

n n 

In case (a), we note that — = — ttg + ct s — a\ n , but there must exist a constant C 
such that \a\^ — ot^ n \ < C for all n by our earlier argument. This would imply that limsup n (a 3 n ^ — 
a^) = oo, a contradiction. Similarly, for case (b), we may use ol^ — = — aq + — 
and our ealier argument that \a\^ — < C for all n to see that case (b) would require 
limsup^a^ ~~ a i"^) = 00 > another contradiction. 

As a consequence, there must be some C < oo with — | < C for all i,j,n. The 

conditional surrogate risk l^(a,\i) is shift invariant with respect to a, so without loss of generality 

we may assume = 0, and thus \a\ n '\ < C. Convex functions are continuous on compact 
domains jiil . Chapter IV. 3], and thus some a with Halloo < C attains the infimum 

inf t v (a,ii) = tJa,n). 
k>L<C 
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C.5 Proof of Lemma 1111 



Lemma [TOl shows that the optimal £^(fi) is attained by some finite a. Thus, we fix an a* satisfying 
the optimality condition (j4"4"j) and let 8ij = a* — a* and gij = </>'(<%) for i ^ j . We make use of 



the monotonicity of subgradients, that is, 8ij > 8^i implies gij > g^i [e.g., l25l . Chapter VI]. By 
condition (jM]), 

313 - 912 = -(l + Tr-)gi2 (45a) 



hl3 V ^13 

313 - 323 = T^-331 - f 1 + ) 323- (45b) 
"13 V "13/ 

Suppose for the sake of contradiction that a* E argmin^ £(a, fi). As ^13 = 812 + 823 1 we have 
that <5i3 > 812 and 8x3 > 823- The convexity of (ft implies that if 813 > 812, then 313 > gi2- If 
312 > 0, we thus have that 513 > and by ()44p . 531 > 0. This is a contradiction since £31 < gives 
331 < 0'(O) < 0. Hence, (712 < 0. By identical reasoning, we also have that 523 < 0. 

Now, 823 > > 831 implies that 523 > 331 > which combined with the equality (|45ap and the fact 
that #23 = (^12/^23)312 (by the first-order optimality equation (14"4"|) 1 gives 

, /131 A . ^12 "\ fh 3 ih 12 . \ 312 

313 - 312 < 7—323 - 1 + 7— 312 = — r "13 - "12 7— • 

"13 V "13/ V "23 / "13 

Since 512/^13 < 0, we have that 513 — 512 < whenever /131/112//123 > ^13 + ^12- But when £13 > #12, 
we must have 513 > 512, which yields a contradiction under Condition 1. 

Similarly, #12 > > 831 implies that 512 > 331, which with 512 = (^23/^12)323 and equality (|45b|) 
yields 

. fal f, , ^23^ fh 3 ih 2 3 , , \ 323 

313 - 323 < 7—312 - 1 + 7— 323 = — T «43 ~ «23 7— • 

hl3 V "13/ V "12 / "13 

Since 323/^13 < 0, we further have that 513 — 523 < whenever ^31/123/^12 > ^13 + ^23- This 
contradicts 813 > 823 under Condition 2. 
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